Inferact Tech Stack

LLM inference engine optimization and hardware integration

Software Development San Francisco, CA 11–50 employees Founded 2025 Privately Held

Inferact is a systems-level infrastructure company founded by core maintainers of vLLM, focusing on kernel optimization, accelerator integration, and operational deployment for LLM inference. The engineering-heavy, senior-focused team (4 senior + 1 staff engineer across 5 total roles) reflects a deep IC culture built around low-level performance work—CUDA, Triton, MLIR, and FlashAttention dominate the stack—rather than sales velocity. Active projects span kernel engineering, hardware enablement, and cluster management, directly addressing the pain points driving the company: accelerator utilization, inference latency, and operational scale.

Tech Stack 30 technologies

Core StackC++ Python Rust Go PyTorch Kubernetes Terraform Helm AWS CUDA Triton NVIDIA Nsight FlashAttention NVIDIA AMD TPU Intel LLVM MLIR XLA vLLM TensorRT-LLM SGLang GCP Azure Ray Slurm Unsloth NVLink InfiniBand

What Inferact Is Building

◆Challenges

Maximizing accelerator performance
Integrating new hardware
Optimizing inference speed
Making inference cheaper and faster
Optimizing model execution across hardware
Scaling inference globally
Operational reliability for ml systems
Deployment automation for ai models
Operational complexity at massive scale
Supporting larger models

▲Active Projects

Kernel optimization for vllm inference engine
Accelerator integration for new hardware
Performance tuning for inference engines
Inference runtime
Kernel engineering
Cloud orchestration
Operational backbone for vllm
Cluster management system
Deployment automation for ai models
Diffusion model serving

Hiring Activity

Minimal5 roles · 0 in 30d

Department

Engineering

Seniority

Senior

Staff

Company intelligence

Find more companies like Inferact by tech stack, pain points and active projects

Get started free

About Inferact

Inferact builds infrastructure to optimize LLM inference across diverse hardware and deployment contexts. The company was founded by the creators and core maintainers of vLLM, an open-source inference engine. The product and organizational focus is on kernel-level performance, integrating new accelerators (NVIDIA, AMD, Intel, TPU), and automating deployment at scale. The tech stack—CUDA, Triton, C++, PyTorch, TensorRT-LLM, Kubernetes, and cluster orchestration tools (Slurm, Ray)—reflects a systems engineering organization solving problems around model execution speed, cost, and reliability. Inferact operates from San Francisco with a small, senior-heavy team.

HeadquartersSan Francisco, CA

Company Size11–50 employees

Founded2025

Hiring MarketsUnited States

Frequently Asked Questions

What is Inferact's tech stack?

Core: CUDA, Triton, C++, Python, PyTorch, vLLM, LLVM, MLIR. Infrastructure: Kubernetes, Terraform, Helm, Slurm, Ray. Hardware support: NVIDIA, AMD, Intel, TPU via XLA and TensorRT-LLM.

What is Inferact working on?

Kernel optimization for vLLM, accelerator integration, inference runtime performance, cluster management, deployment automation, and diffusion model serving. Primary focus: maximizing hardware utilization and reducing inference cost and latency at scale.