GPU-optimized inference platform for open-weight LLMs at production scale
FriendliAI operates a specialized inference engine built on continuous batching—a foundational technique the team invented. The stack (Python, Rust, C++, CUDA, ROCm, Triton, Kubernetes) reflects deep GPU systems work; hiring leans heavily senior/staff engineers with minimal marketing presence, indicating a developer-first, research-driven positioning. Active projects span custom GPU kernels, multi-modal pipelines, and an agent execution platform—suggesting the company is moving beyond basic LLM serving into more complex inference workloads.
FriendliAI builds an inference platform optimized for running open-weight and custom AI models on GPU infrastructure. Founded in 2021 and based in San Francisco, the company targets AI engineers and ML teams seeking production-grade model deployment with lower latency and cost than closed-model APIs. The platform includes a proprietary inference engine, custom GPU kernel development, and a web interface for multi-modal model deployment. The company operates as a small, engineering-heavy organization with active hiring in the United States and South Korea.
Core languages: Python, Rust, C++, Go. GPU/compute: NVIDIA CUDA, ROCm, Triton. Serving: FastAPI, gRPC, GraphQL. Data/ops: PostgreSQL, Kubernetes, OpenTelemetry. ML tooling: Hugging Face. Frontend: React, Next.js, TypeScript.
Multi-modal inference pipelines, custom GPU kernel optimization, an agent execution platform, a proprietary engine supporting 450k models, and a web platform for model deployment. Focus areas include low-latency inference, performance profiling, and cross-vendor GPU support (NVIDIA and AMD).
Other companies in the same industry, closest in size