AI cluster observability and reliability platform for large-scale GPU workloads
Clockwork Systems builds a software layer for AI infrastructure, focusing on observability, determinism, and resilience in GPU clusters. The tech stack is heavily kernel-level (eBPF, RDMA, DPDK, NCCL, CUDA) and networking-focused (InfiniBand, TCP/IP monitoring), indicating deep systems engineering rather than application-layer abstraction. Hiring is concentrated in senior engineers while sales scales internationally—a pattern suggesting infrastructure sales cycles are maturing beyond early adopters.
Clockwork Systems delivers a programmable software platform called FleetIQ that makes large-scale AI clusters observable and deterministic. The company targets enterprises training and deploying GPU-intensive workloads, addressing pain points around cluster utilization, performance bottlenecks, and infrastructure reliability. Founded in 2018 and based in Palo Alto, Clockwork is an 11–50-person company with engineering depth in high-performance systems, currently expanding sales coverage into the UK and Middle East markets.
Clockwork's stack spans kernel-level tools (eBPF, DPDK, RDMA), GPU frameworks (CUDA, NCCL, PyTorch), container orchestration (Kubernetes), and observability (OpenTelemetry, Prometheus, DCGM). Frontend layers use TypeScript, React, Vue, and D3.js. Infrastructure runs on AWS, GCP, and Azure.
Active projects include AI and GPU cluster observability, kernel-level observability via eBPF, high-performance networking sensors for TCP/IP monitoring, metric collection systems, and GTM expansion into UK and Middle East markets.
Other companies in the same industry, closest in size