Open source vector database optimized for multimodal AI and RAG applications
LanceDB is an open source vector database built for AI applications requiring multimodal search and retrieval-augmented generation (RAG). The tech stack—PyTorch, TensorFlow, Ray, Apache Spark, Iceberg, Delta Lake, and Arrow—reflects deep integration with modern ML infrastructure and data lakehouse tooling. Current hiring is senior-skewed (4 of 5 open roles) and engineering-focused, while the project backlog centers on scaling the backend for billion-scale datasets and hardening distributed operations—suggesting the company is moving past initial developer adoption toward production-grade infrastructure.
LanceDB, founded in 2022 and based in San Francisco, is an open source database purpose-built for vector search and AI workloads. The product targets developers building applications that require multimodal data retrieval, feature engineering, and interactive exploration of large-scale datasets. The engineering roadmap emphasizes ecosystem integrations (Spark, Hive Metastore, Presto, Trino, Ray), operational stability, and a managed cloud offering for billion-scale datasets. The team is 11–50 people, actively hiring senior engineers in the US.
Core infrastructure: PyTorch, TensorFlow, Ray, Apache Spark, Kubernetes. Data layer: Iceberg, Delta Lake, Hudi, Parquet, Apache Arrow, DataFusion. Ecosystem: Feast, Tecton, Presto, Trino. Cloud: AWS, GCP, Azure.
Integrating the Lance format with Spark, Hive Metastore, Presto, Trino, and Ray. Building efficient indices for predicate pushdown. Developing a scalable backend for LanceDB Cloud to handle billion-scale datasets and a serverless experience.
Other companies in the same industry, closest in size