Real-time speech AI APIs optimized for inference efficiency
Subquadratic builds a speech-to-text platform designed around inference efficiency rather than raw scale—a counter to the industry's race toward larger models. The tech stack is heavy on distributed systems (Spark, Dask, Ray, PyTorch, TensorFlow) and production ML infrastructure (Kubernetes, Triton, TorchServe), reflecting their focus on low-latency, cost-efficient inference at scale. Active projects reveal a company wrestling with trillion-token data pipelines, edge inference, and real-time streaming stability, while pain points around GPU utilization and cold-start latency signal they're optimizing the full inference stack, not just the model itself.
Subquadratic is an AI infrastructure company providing speech-to-text APIs designed for performance and cost efficiency. Their platform targets developers building voice-native applications and voice agents, with a roadmap extending into text-to-speech, speech-to-speech, and multimodal interaction. The engineering and data-focused hiring mix (4 engineers, 2 data roles, plus leadership) mirrors their project velocity: trillion-token scale data pipelines, distributed inference systems, real-time streaming APIs, and edge deployment. Based in Miami with 11–50 employees, the company is actively scaling infrastructure to support production voice workloads.
Python, PyTorch, TensorFlow, Apache Spark, Dask, Ray, Kubernetes, Triton, TorchServe, AWS, gRPC, WebSocket, OpenTelemetry, PostgreSQL, Redis, TypeScript, React, Node.js.
Trillion-token data pipelines, real-time speech-to-text and LLM token streaming APIs, distributed inference optimization, edge deployment, synthetic data generation for pretraining, and systems for data versioning and reproducibility.
Other companies in the same industry, closest in size