Skip to main content

LLM Latency War Room

The incident starts with a dashboard that looks calm. Pods are ready, the Service has endpoints, and the gateway is not returning errors. Users still wait 14 seconds before the first token arrives.

This is a classic LLM latency on Kubernetes failure: Kubernetes health says the container is alive, but the inference path is saturated somewhere between routing, queueing, prefill, decode, GPU memory, or model loading.

Scenario

A platform team rolls out a new vLLM model revision behind the same gateway route. Request volume is stable, but p95 time to first token jumps from 2 seconds to 14 seconds. Horizontal scaling adds another replica, but latency stays high for 20 minutes because the new pod is still loading the model and warming cache.

Symptoms

SymptomWhy it misleads the team
Readiness is greenThe probe may verify the server process, not model readiness under traffic.
Gateway latency increasesThe gateway sees the delay, but it does not explain prefill, queue wait, or decode.
CPU looks normalThe bottleneck is usually GPU memory, batching, cache pressure, or runtime queueing.
Adding replicas is slowLLM replicas have image pull, model download, model load, and warmup costs.

Common wrong instinct

"The pod is healthy, so scale replicas."

That can help only if replica capacity is the bottleneck and new replicas become ready quickly enough. For LLM serving, scale-out can be too slow for an active incident unless warm capacity, model cache, and readiness semantics are already designed.

Production reasoning

Split the request path into measurable phases:

PhaseSignal to inspectPlatform decision
Gatewayrequest duration, timeout, streaming behaviorKeep gateway metrics separate from runtime metrics.
Runtime queuequeue wait, active sequences, pending requestsScale from inference-specific signals, not only CPU.
Prefillinput token count, TTFT, prompt classSeparate long-context traffic from interactive traffic.
Decodetokens/sec, inter-token latencyCompare runtime flags and GPU profile against baseline.
GPUmemory use, utilization, cache pressureMatch model revision to node profile and capacity buffer.
Rolloutmodel load time, readiness durationGate traffic until the model is actually ready.

Decision checklist

  • Does readiness prove the model is loaded and able to answer a small request?
  • Are TTFT and queue wait emitted by model, route, tenant, and revision?
  • Do long prompts share the same runtime queue as short interactive prompts?
  • Is model load time included in rollout SLOs and autoscaling expectations?
  • Can the team roll back the route, the runtime image, and the model artifact together?

Run the vLLM inference challenge to practice validating GPU placement, time to first token, queue wait, and runtime health.