LLM On Kubernetes
LLM workloads stress Kubernetes in unusual ways: GPU capacity is scarce, model startup is slow, latency has multiple phases, and request cost varies by token count. Treat LLM on Kubernetes as a platform capability with its own SLOs, capacity model, security boundary, and release process.
Core capabilities
| Capability | Why it matters |
|---|---|
| GPU node pools | Keep accelerator scheduling explicit and protect general workloads. |
| Model runtime | Handles batching, KV cache, streaming, and model loading. |
| Serving abstraction | Routes traffic, manages revisions, exposes autoscaling hooks. |
| RAG services | Adds retrieval, metadata policy, and evaluation loop. |
| Telemetry | Measures TTFT, throughput, queue wait, GPU saturation, and cost. |
SEO pillar guides
| Guide | Use it to decide |
|---|---|
| GPU Node Pool Kubernetes | How to isolate and schedule accelerator capacity. |
| vLLM On Kubernetes | How to run vLLM as the model runtime layer. |
| Model Serving Options | How to choose vLLM, KServe, Ray Serve, or Triton. |
| KServe vs Ray Serve | Which serving abstraction fits the team and workload. |
| RAG On Kubernetes | How to operate ingestion, retrieval, generation, and evaluation. |
| Inference Benchmarking And Cost Model | How to measure latency phases and unit economics. |
| Inference Scaling And Cost | How to scale beyond CPU metrics. |
Operating principle
Do not choose a model-serving runtime before defining the platform contract around it: GPU placement, traffic policy, model artifact ownership, rollout and rollback, telemetry, and cost reporting. Runtime benchmarks are only useful when they include the Kubernetes gateway, autoscaling, and observability path that production users will actually hit.