Inferact is a systems-level infrastructure company founded by core maintainers of vLLM, focusing on kernel optimization, accelerator integration, and operational deployment for LLM inference. The engineering-heavy, senior-focused team (4 senior + 1 staff engineer across 5 total roles) reflects a deep IC culture built around low-level performance work—CUDA, Triton, MLIR, and FlashAttention dominate the stack—rather than sales velocity. Active projects span kernel engineering, hardware enablement, and cluster management, directly addressing the pain points driving the company: accelerator utilization, inference latency, and operational scale.
Inferact builds infrastructure to optimize LLM inference across diverse hardware and deployment contexts. The company was founded by the creators and core maintainers of vLLM, an open-source inference engine. The product and organizational focus is on kernel-level performance, integrating new accelerators (NVIDIA, AMD, Intel, TPU), and automating deployment at scale. The tech stack—CUDA, Triton, C++, PyTorch, TensorRT-LLM, Kubernetes, and cluster orchestration tools (Slurm, Ray)—reflects a systems engineering organization solving problems around model execution speed, cost, and reliability. Inferact operates from San Francisco with a small, senior-heavy team.
Core: CUDA, Triton, C++, Python, PyTorch, vLLM, LLVM, MLIR. Infrastructure: Kubernetes, Terraform, Helm, Slurm, Ray. Hardware support: NVIDIA, AMD, Intel, TPU via XLA and TensorRT-LLM.
Kernel optimization for vLLM, accelerator integration, inference runtime performance, cluster management, deployment automation, and diffusion model serving. Primary focus: maximizing hardware utilization and reducing inference cost and latency at scale.
Other companies in the same industry, closest in size