LLM Latency War Room
The incident starts with a dashboard that looks calm. Pods are ready, the Service has endpoints, and the gateway is not returning errors. Users still wait 14 seconds before the first token arrives.
This is a classic LLM latency on Kubernetes failure: Kubernetes health says the container is alive, but the inference path is saturated somewhere between routing, queueing, prefill, decode, GPU memory, or model loading.
Scenario
A platform team rolls out a new vLLM model revision behind the same gateway route. Request volume is stable, but p95 time to first token jumps from 2 seconds to 14 seconds. Horizontal scaling adds another replica, but latency stays high for 20 minutes because the new pod is still loading the model and warming cache.
Symptoms
| Symptom | Why it misleads the team |
|---|---|
| Readiness is green | The probe may verify the server process, not model readiness under traffic. |
| Gateway latency increases | The gateway sees the delay, but it does not explain prefill, queue wait, or decode. |
| CPU looks normal | The bottleneck is usually GPU memory, batching, cache pressure, or runtime queueing. |
| Adding replicas is slow | LLM replicas have image pull, model download, model load, and warmup costs. |
Common wrong instinct
"The pod is healthy, so scale replicas."
That can help only if replica capacity is the bottleneck and new replicas become ready quickly enough. For LLM serving, scale-out can be too slow for an active incident unless warm capacity, model cache, and readiness semantics are already designed.
Production reasoning
Split the request path into measurable phases:
| Phase | Signal to inspect | Platform decision |
|---|---|---|
| Gateway | request duration, timeout, streaming behavior | Keep gateway metrics separate from runtime metrics. |
| Runtime queue | queue wait, active sequences, pending requests | Scale from inference-specific signals, not only CPU. |
| Prefill | input token count, TTFT, prompt class | Separate long-context traffic from interactive traffic. |
| Decode | tokens/sec, inter-token latency | Compare runtime flags and GPU profile against baseline. |
| GPU | memory use, utilization, cache pressure | Match model revision to node profile and capacity buffer. |
| Rollout | model load time, readiness duration | Gate traffic until the model is actually ready. |
Decision checklist
- Does readiness prove the model is loaded and able to answer a small request?
- Are TTFT and queue wait emitted by model, route, tenant, and revision?
- Do long prompts share the same runtime queue as short interactive prompts?
- Is model load time included in rollout SLOs and autoscaling expectations?
- Can the team roll back the route, the runtime image, and the model artifact together?
Related lab
Run the vLLM inference challenge to practice validating GPU placement, time to first token, queue wait, and runtime health.