Skip to main content

LLM On Kubernetes

LLM workloads stress Kubernetes in unusual ways: GPU capacity is scarce, model startup is slow, latency has multiple phases, and request cost varies by token count. Treat LLM on Kubernetes as a platform capability with its own SLOs, capacity model, security boundary, and release process.

LLM inference stack on Kubernetes architecture

Core capabilities

CapabilityWhy it matters
GPU node poolsKeep accelerator scheduling explicit and protect general workloads.
Model runtimeHandles batching, KV cache, streaming, and model loading.
Serving abstractionRoutes traffic, manages revisions, exposes autoscaling hooks.
RAG servicesAdds retrieval, metadata policy, and evaluation loop.
TelemetryMeasures TTFT, throughput, queue wait, GPU saturation, and cost.

SEO pillar guides

GuideUse it to decide
GPU Node Pool KubernetesHow to isolate and schedule accelerator capacity.
vLLM On KubernetesHow to run vLLM as the model runtime layer.
Model Serving OptionsHow to choose vLLM, KServe, Ray Serve, or Triton.
KServe vs Ray ServeWhich serving abstraction fits the team and workload.
RAG On KubernetesHow to operate ingestion, retrieval, generation, and evaluation.
Inference Benchmarking And Cost ModelHow to measure latency phases and unit economics.
Inference Scaling And CostHow to scale beyond CPU metrics.

Operating principle

Do not choose a model-serving runtime before defining the platform contract around it: GPU placement, traffic policy, model artifact ownership, rollout and rollback, telemetry, and cost reporting. Runtime benchmarks are only useful when they include the Kubernetes gateway, autoscaling, and observability path that production users will actually hit.