Open-source metadata platform for AI and data governance at scale
DataHub is a metadata management platform handling 3M+ PyPI downloads monthly, built on an extensible graph architecture with lineage-driven compliance. The stack—Python, PyTorch, TensorFlow, Kafka, Spark, dbt, Airflow—reflects a data-platform organization shipping both open-source and cloud variants. Active project focus on real-time metadata processing, scalable ingestion, and observability infrastructure signals DataHub is scaling toward enterprise customers managing machine-scale metadata volumes while addressing adoption friction and customer churn risks.
DataHub provides discovery, governance, and observability for data and AI assets through a dual-product model: an open-source core (DataHub Core) and a fully-managed cloud offering (DataHub Cloud). The platform connects to 80+ data sources, ingests metadata at high velocity, and includes AI-based enhancements for discovery and quality management. The company operates at 51–200 employees, headquartered in Palo Alto, and maintains active engineering and support operations across the United States, India, and the United Arab Emirates. DataHub's open-source foundation has generated a community of over 13,000 users.
DataHub uses Python, PyTorch, TensorFlow, Kafka, Apache Spark, Airflow, dbt, GraphQL, React, TypeScript, Elasticsearch, Kubernetes, Docker, and AWS/GCP. The stack spans real-time processing, ML frameworks, data transformation, and cloud infrastructure.
DataHub operates at 51–200 employees, founded in 2021 and privately held. The company is headquartered in Palo Alto, California.
DataHub's technology stack, projects, and hiring signals are inferred from public hiring and company data — career pages, public listings, and company web presence — then clustered and de-duplicated. Figures are estimates that refresh over time. Read our full methodology →
This is not an official vendor or customer list. It is a technology-adoption signal inferred from public data, intended for B2B research.