Data selection tools for training efficient deep learning models
DatologyAI automates data selection for deep learning training—identifying which data points to include or exclude before model training begins. The stack is heavily Python + PyTorch for ML work, with Apache Spark and Flink for data processing at scale, and multi-cloud infrastructure (AWS, Azure, GCP) suggesting customer deployments across regions. Internal pain points center on compute waste: wasted training compute, inefficient data curation, and training on irrelevant data are all listed as active challenges, which directly map to their product thesis—selecting better training data reduces both time and cost.
DatologyAI builds automated data-selection tools for deep learning teams. Rather than training on all available data, their platform identifies redundant, noisy, or harmful data points before training begins, allowing customers to train better-performing models on smaller, curated datasets. The approach is agnostic to data modality (text, image, or other) and does not require labeled data, lowering adoption friction. Founded in 2023, the company is headquartered in Redwood City, California, and operates as a lean engineering-first organization (5 engineers, 1 researcher) with early sales and marketing functions. Their active projects span training infrastructure, data curation platforms, multi-cloud deployment, and model serving—all foundational to scaling the product across customer environments.
Core: Python, PyTorch, Apache Spark, and Apache Flink for data processing. Infrastructure: Kubernetes, Terraform, AWS, Azure, GCP, CloudFormation, and Pulumi for multi-cloud orchestration. Version control: GitHub.
Redwood City, California. Founded in 2023, the company is privately held with 11–50 employees and currently hiring only in the United States.
Other companies in the same industry, closest in size