Skip to main content

Lakehouse Integration Architecture

The AINative Lakehouse is a MinIO/Parquet/DuckDB architecture that ingests, enriches, and serves data across ZeroDB services, the agent swarm, and the recursive intelligence loop.

Data Inventory

DatasetRecordsSourceUpdate Frequency
SMB Businesses264,632Socrata Open Data (6 cities)Weekly cron
Federal Grants900USAspending.govMonthly
WHOIS Cache363 (54 enriched)Apollo/Hunter/WHOISDaily DAG
SC County Parcels97,182Santa Cruz CountyMonthly
Knowledge Graph Entities39,714CRM + Enrichment + GrantsDaily
Knowledge Graph Edges49,326works_at, connected_to, received_grant_fromDaily
Paul Graham Essays231paulgraham.comOn-demand
Video Transcripts1,275+YouTube (5 channels)Weekly

SMB City Coverage

CityRecordsSource API
Denver50,000Colorado SOS (data.colorado.gov)
Los Angeles50,000LA Business Tax (data.lacity.org)
Seattle50,000WA State (data.wa.gov)
San Francisco50,000SF Business Registry (data.sfgov.org)
New York City43,581NYC Business Licenses (data.cityofnewyork.us)
Austin21,051Austin Food Establishments (data.austintexas.gov)

Connectors

All connectors are Python scripts in scripts/:

ConnectorScriptWhat It Does
SMB Citysmb_city_connector.pyPulls business data from Socrata APIs for 6 US cities
Federal Grantsfederal_grants_connector.pyFetches grant data from USAspending.gov
WHOIS Discoverywhois_discovery_domains.pyEnriches domains with WHOIS data
Montana Livestockmontana_livestock_connector.pyAuction market data
Montana Web3montana_web3_connector.pyCrypto/blockchain entities
SC Countysanta_cruz_county_connector.pyProperty parcel data
SMB Enrichmentsmb_enrichment_pipeline.pyApollo/Hunter/WHOIS/LinkedIn enrichment
Grant Graphbuild_grant_graph_edges.pyBuilds KG edges between grants and SMB businesses
PGWisdompgwisdom_connector.pyPaul Graham essays → lakehouse + ZeroMemory
Video Transcriptsresearch_agent.pyYouTube transcript ingestion (5 channels)

Recursive Integration Loop

The lakehouse data feeds into ZeroDB services via 4 Celery tasks:

Lakehouse Data

lake.export_marketplace_data (05:00 UTC daily)
→ SMB/grants/WHOIS → Parquet lakehouse (MinIO)

lake.sync_graph_to_zerodb (05:30 UTC daily)
→ KG entity/edge updates → ZeroDB events

lake.emit_lakehouse_signals (06:00 UTC daily)
→ Data snapshot → ZeroMemory agent signals

Agent Swarm (Scout, Vega, Aurora)
→ Consumes signals, triggers downstream actions

lake.score_seed_quality (on project creation)
→ RLHF feedback on auto-seeded prospect quality
→ Feeds back into enrichment DAG adjustments

Auto-Seed Pipeline

When a new project is created:

  1. POST /api/v1/projects dispatches seed_new_project.delay()
  2. Task pulls 500 prospects from SMB directory + WHOIS data
  3. Scores each prospect with get_lead_score() (0-100)
  4. Stores as rows in project's prospects table
  5. Triggers score_seed_quality for RLHF feedback

Optional params: industry (saas, ecommerce, finance, healthcare, realestate) and city.

GraphRAG Prospecting

POST /api/v1/public/data/prospect — Hybrid search combining:

  • Text relevance (40%): Full-text search against SMB businesses
  • Graph proximity (30%): Companies connected to CRM contacts or grant recipients rank higher
  • Lead score (30%): Enrichment pipeline score (phone, WHOIS, tech stack, city)

Video Transcription Channels

ChannelVideosTags
Peter Diamandis864moonshots, exponential-tech
This Week in Startups149startups, venture-capital
Anthropic Official140anthropic, claude, ai-safety
Claude Official103claude-code, developer-tools
AINative AI Agents19agent-cloud, ai-agents

Transcripts stored in scripts/outputs/transcripts/ and ingested to ZeroMemory.

Key Files

FilePurpose
scripts/smb_city_connector.pySMB business data ingestion
scripts/smb_enrichment_pipeline.pyMulti-source enrichment
scripts/build_grant_graph_edges.pyGrant → company KG edges
scripts/pgwisdom_connector.pyPaul Graham essays
scripts/research_agent.pyYouTube transcript ingestion
src/backend/app/celery_tasks/lakehouse_integration_tasks.py4 integration Celery tasks
src/backend/app/celery_tasks/seed_project_task.pyAuto-seed on project creation
src/backend/app/api/v1/endpoints/graphrag_prospect.pyGraphRAG prospecting API
src/backend/app/api/v1/endpoints/data_marketplace.pyData marketplace endpoints
src/backend/app/services/seed_data.pySeed data service
services/airflow/dags/smb_enrichment_dag.pyDaily enrichment DAG