Skip to main content

LLM Latency on Kubernetes

LLM latency on Kubernetes is rarely a single "pod is slow" issue. The user waits across gateway routing, runtime queueing, prompt prefill, token decode, GPU memory pressure, and sometimes model cold start.

Last reviewed: June 8, 2026. Use this page when Kubernetes readiness is green but time to first token is not acceptable.

LLM inference stack on Kubernetes architecture

Scenario

A vLLM route is live behind the gateway. Pods are ready and error rate is low, but p95 time to first token jumps from 2 seconds to 14 seconds after a model revision. Adding replicas does not help because new pods spend minutes pulling the image and loading the model.

Decision table

LayerSignalPlatform decision
Gatewayrequest duration, stream start time, timeoutKeep gateway latency separate from runtime latency.
Runtime queuequeue wait, active sequences, pending requestsScale from inference signals, not CPU alone.
Prefillinput token count, TTFT by routeSeparate long-context traffic from short interactive traffic.
Decodeinter-token latency, tokens/secCompare runtime flags and GPU profile against baseline.
GPUmemory used, cache pressure, utilizationMatch model revision to accelerator profile.
Rolloutmodel load duration, readiness delayGate traffic on model readiness, not only process readiness.

Commands and checks

kubectl -n llm-serving get pod -o wide
kubectl -n llm-serving logs deploy/vllm-runtime --tail=100
curl -sS "$GATEWAY/v1/models"
curl -sS "$METRICS" | grep -Ei "ttft|queue|tokens|gpu|latency"
CheckPass signal
Model readinessLogs or health endpoint prove the model is loaded before traffic shifts.
Queue visibilityQueue wait is visible separately from gateway latency.
TTFT splitTTFT is reported by model, route, and request class.
Cold-start budgetModel load time is included in rollout and autoscaling decisions.

Start the vLLM inference challenge to practice validating TTFT, queue wait, GPU placement, and runtime health.