LLM on Kubernetes

LLM workloads stress Kubernetes in unusual ways: GPU capacity is scarce, model startup is slow, latency has multiple phases, and request cost varies by token count. Treat LLM on Kubernetes as a platform capability with its own SLOs, capacity model, security boundary, and release process.

If you arrived through the keyword k8s llm, start with K8s LLM: Kubernetes LLM Platform Guide, then use this section for deeper runtime and operating decisions.

LLM inference stack on Kubernetes architecture

Core capabilities

Capability	Why it matters
GPU node pools	Keep accelerator scheduling explicit and protect general workloads.
Model runtime	Handles batching, KV cache, streaming, and model loading.
Serving abstraction	Routes traffic, manages revisions, exposes autoscaling hooks.
RAG services	Adds retrieval, metadata policy, and evaluation loop.
Telemetry	Measures TTFT, throughput, queue wait, GPU saturation, and cost.

SEO pillar guides

Guide	Use it to decide
K8s LLM: Kubernetes LLM Platform Guide	How the Kubernetes LLM platform pieces fit together.
GPU Node Pool Kubernetes	How to isolate and schedule accelerator capacity.
vLLM on Kubernetes	How to run vLLM as the model runtime layer.
Model Serving Options	How to choose vLLM, KServe, Ray Serve, or Triton.
KServe vs Ray Serve	Which serving abstraction fits the team and workload.
RAG on Kubernetes	How to operate ingestion, retrieval, generation, and evaluation.
Inference Benchmarking and Cost Model	How to measure latency phases and unit economics.
Inference Scaling and Cost	How to scale beyond CPU metrics.

Challenge labs

Lab	What you practice
vLLM Inference Challenge	GPU scheduling, runtime health, OpenAI-compatible serving, and latency checks.
RAG Retrieval Challenge	Ingestion, vector retrieval, metadata filters, answer quality, and failure drills.
Production Readiness Challenge	Launch checks for security, quota, rollback, observability, and cost.
LLM Observability Challenge	TTFT, queue wait, GPU pressure, traces, labels, and alert design.

Operating principle

Do not choose a model-serving runtime before defining the platform contract around it: GPU placement, traffic policy, model artifact ownership, rollout and rollback, telemetry, and cost reporting. Runtime benchmarks are only useful when they include the Kubernetes gateway, autoscaling, and observability path that production users will actually hit.

Core capabilities​

SEO pillar guides​

Challenge labs​

Operating principle​

Core capabilities

SEO pillar guides

Challenge labs

Operating principle