LLM Latency on Kubernetes
LLM latency on Kubernetes is rarely a single "pod is slow" issue. The user waits across gateway routing, runtime queueing, prompt prefill, token decode, GPU memory pressure, and sometimes model cold start.
Last reviewed: June 8, 2026. Use this page when Kubernetes readiness is green but time to first token is not acceptable.
Scenario
A vLLM route is live behind the gateway. Pods are ready and error rate is low, but p95 time to first token jumps from 2 seconds to 14 seconds after a model revision. Adding replicas does not help because new pods spend minutes pulling the image and loading the model.
Decision table
| Layer | Signal | Platform decision |
|---|---|---|
| Gateway | request duration, stream start time, timeout | Keep gateway latency separate from runtime latency. |
| Runtime queue | queue wait, active sequences, pending requests | Scale from inference signals, not CPU alone. |
| Prefill | input token count, TTFT by route | Separate long-context traffic from short interactive traffic. |
| Decode | inter-token latency, tokens/sec | Compare runtime flags and GPU profile against baseline. |
| GPU | memory used, cache pressure, utilization | Match model revision to accelerator profile. |
| Rollout | model load duration, readiness delay | Gate traffic on model readiness, not only process readiness. |
Commands and checks
kubectl -n llm-serving get pod -o wide
kubectl -n llm-serving logs deploy/vllm-runtime --tail=100
curl -sS "$GATEWAY/v1/models"
curl -sS "$METRICS" | grep -Ei "ttft|queue|tokens|gpu|latency"
| Check | Pass signal |
|---|---|
| Model readiness | Logs or health endpoint prove the model is loaded before traffic shifts. |
| Queue visibility | Queue wait is visible separately from gateway latency. |
| TTFT split | TTFT is reported by model, route, and request class. |
| Cold-start budget | Model load time is included in rollout and autoscaling decisions. |
Related lab
Start the vLLM inference challenge to practice validating TTFT, queue wait, GPU placement, and runtime health.