Lakehouse Integration Architecture
The AINative Lakehouse is a MinIO/Parquet/DuckDB architecture that ingests, enriches, and serves data across ZeroDB services, the agent swarm, and the recursive intelligence loop.
Data Inventory
| Dataset | Records | Source | Update Frequency |
|---|---|---|---|
| SMB Businesses | 264,632 | Socrata Open Data (6 cities) | Weekly cron |
| Federal Grants | 900 | USAspending.gov | Monthly |
| WHOIS Cache | 363 (54 enriched) | Apollo/Hunter/WHOIS | Daily DAG |
| SC County Parcels | 97,182 | Santa Cruz County | Monthly |
| Knowledge Graph Entities | 39,714 | CRM + Enrichment + Grants | Daily |
| Knowledge Graph Edges | 49,326 | works_at, connected_to, received_grant_from | Daily |
| Paul Graham Essays | 231 | paulgraham.com | On-demand |
| Video Transcripts | 1,275+ | YouTube (5 channels) | Weekly |
SMB City Coverage
| City | Records | Source API |
|---|---|---|
| Denver | 50,000 | Colorado SOS (data.colorado.gov) |
| Los Angeles | 50,000 | LA Business Tax (data.lacity.org) |
| Seattle | 50,000 | WA State (data.wa.gov) |
| San Francisco | 50,000 | SF Business Registry (data.sfgov.org) |
| New York City | 43,581 | NYC Business Licenses (data.cityofnewyork.us) |
| Austin | 21,051 | Austin Food Establishments (data.austintexas.gov) |
Connectors
All connectors are Python scripts in scripts/:
| Connector | Script | What It Does |
|---|---|---|
| SMB City | smb_city_connector.py | Pulls business data from Socrata APIs for 6 US cities |
| Federal Grants | federal_grants_connector.py | Fetches grant data from USAspending.gov |
| WHOIS Discovery | whois_discovery_domains.py | Enriches domains with WHOIS data |
| Montana Livestock | montana_livestock_connector.py | Auction market data |
| Montana Web3 | montana_web3_connector.py | Crypto/blockchain entities |
| SC County | santa_cruz_county_connector.py | Property parcel data |
| SMB Enrichment | smb_enrichment_pipeline.py | Apollo/Hunter/WHOIS/LinkedIn enrichment |
| Grant Graph | build_grant_graph_edges.py | Builds KG edges between grants and SMB businesses |
| PGWisdom | pgwisdom_connector.py | Paul Graham essays → lakehouse + ZeroMemory |
| Video Transcripts | research_agent.py | YouTube transcript ingestion (5 channels) |
Recursive Integration Loop
The lakehouse data feeds into ZeroDB services via 4 Celery tasks:
Lakehouse Data
↓
lake.export_marketplace_data (05:00 UTC daily)
→ SMB/grants/WHOIS → Parquet lakehouse (MinIO)
↓
lake.sync_graph_to_zerodb (05:30 UTC daily)
→ KG entity/edge updates → ZeroDB events
↓
lake.emit_lakehouse_signals (06:00 UTC daily)
→ Data snapshot → ZeroMemory agent signals
↓
Agent Swarm (Scout, Vega, Aurora)
→ Consumes signals, triggers downstream actions
↓
lake.score_seed_quality (on project creation)
→ RLHF feedback on auto-seeded prospect quality
→ Feeds back into enrichment DAG adjustments
Auto-Seed Pipeline
When a new project is created:
POST /api/v1/projectsdispatchesseed_new_project.delay()- Task pulls 500 prospects from SMB directory + WHOIS data
- Scores each prospect with
get_lead_score()(0-100) - Stores as rows in project's
prospectstable - Triggers
score_seed_qualityfor RLHF feedback
Optional params: industry (saas, ecommerce, finance, healthcare, realestate) and city.
GraphRAG Prospecting
POST /api/v1/public/data/prospect — Hybrid search combining:
- Text relevance (40%): Full-text search against SMB businesses
- Graph proximity (30%): Companies connected to CRM contacts or grant recipients rank higher
- Lead score (30%): Enrichment pipeline score (phone, WHOIS, tech stack, city)
Video Transcription Channels
| Channel | Videos | Tags |
|---|---|---|
| Peter Diamandis | 864 | moonshots, exponential-tech |
| This Week in Startups | 149 | startups, venture-capital |
| Anthropic Official | 140 | anthropic, claude, ai-safety |
| Claude Official | 103 | claude-code, developer-tools |
| AINative AI Agents | 19 | agent-cloud, ai-agents |
Transcripts stored in scripts/outputs/transcripts/ and ingested to ZeroMemory.
Key Files
| File | Purpose |
|---|---|
scripts/smb_city_connector.py | SMB business data ingestion |
scripts/smb_enrichment_pipeline.py | Multi-source enrichment |
scripts/build_grant_graph_edges.py | Grant → company KG edges |
scripts/pgwisdom_connector.py | Paul Graham essays |
scripts/research_agent.py | YouTube transcript ingestion |
src/backend/app/celery_tasks/lakehouse_integration_tasks.py | 4 integration Celery tasks |
src/backend/app/celery_tasks/seed_project_task.py | Auto-seed on project creation |
src/backend/app/api/v1/endpoints/graphrag_prospect.py | GraphRAG prospecting API |
src/backend/app/api/v1/endpoints/data_marketplace.py | Data marketplace endpoints |
src/backend/app/services/seed_data.py | Seed data service |
services/airflow/dags/smb_enrichment_dag.py | Daily enrichment DAG |