LLM Latency on Kubernetes

LLM latency on Kubernetes is rarely a single "pod is slow" issue. The user waits across gateway routing, runtime queueing, prompt prefill, token decode, GPU memory pressure, and sometimes model cold start.

Last reviewed: June 8, 2026. Use this page when Kubernetes readiness is green but time to first token is not acceptable.

LLM inference stack on Kubernetes architecture

Scenario

A vLLM route is live behind the gateway. Pods are ready and error rate is low, but p95 time to first token jumps from 2 seconds to 14 seconds after a model revision. Adding replicas does not help because new pods spend minutes pulling the image and loading the model.

Decision table

Layer	Signal	Platform decision
Gateway	request duration, stream start time, timeout	Keep gateway latency separate from runtime latency.
Runtime queue	queue wait, active sequences, pending requests	Scale from inference signals, not CPU alone.
Prefill	input token count, TTFT by route	Separate long-context traffic from short interactive traffic.
Decode	inter-token latency, tokens/sec	Compare runtime flags and GPU profile against baseline.
GPU	memory used, cache pressure, utilization	Match model revision to accelerator profile.
Rollout	model load duration, readiness delay	Gate traffic on model readiness, not only process readiness.

Commands and checks

kubectl -n llm-serving get pod -o wide
kubectl -n llm-serving logs deploy/vllm-runtime --tail=100
curl -sS "$GATEWAY/v1/models"
curl -sS "$METRICS" | grep -Ei "ttft|queue|tokens|gpu|latency"

Check	Pass signal
Model readiness	Logs or health endpoint prove the model is loaded before traffic shifts.
Queue visibility	Queue wait is visible separately from gateway latency.
TTFT split	TTFT is reported by model, route, and request class.
Cold-start budget	Model load time is included in rollout and autoscaling decisions.

Start the vLLM inference challenge to practice validating TTFT, queue wait, GPU placement, and runtime health.

Scenario​

Decision table​

Commands and checks​

Related lab​

Related pages​

Scenario

Decision table

Commands and checks

Related lab

Related pages