Skip to main content

LLM Observability Lab

This lab defines the minimum telemetry needed to debug LLM workloads on Kubernetes. Generic pod health is not enough. You need user-facing latency, runtime saturation, GPU pressure, and prompt-path context.

Signal model

LayerSignals
GatewayRPS, errors, request size, route, tenant, status.
RuntimeTTFT, tokens/sec, queue wait, active requests, batch size.
GPUUtilization, memory, temperature, allocation failures.
RAGRetrieval latency, top-k score, source count, evaluation result.
KubernetesRestarts, throttling, OOM, scheduling delay, rollout state.

Step 1: define labels that will survive incidents

Every metric and trace should be attributable.

metadata:
labels:
app.kubernetes.io/name: vllm-openai
app.kubernetes.io/part-of: llm-platform
workload.k8sllm.io/model: mistral-7b
workload.k8sllm.io/route-class: interactive
workload.k8sllm.io/team: platform-ai

Step 2: create an incident dashboard checklist

PanelQuestion it should answer
TTFT by routeAre users waiting before the first token?
Tokens/sec by modelIs generation throughput degraded?
Queue waitIs the runtime saturated before GPU utilization is obvious?
GPU memoryIs KV cache or model load pressure rising?
Pod restartsDid Kubernetes instability cause the symptom?

Step 3: add trace attributes

Trace spans should carry enough context to link application behavior to runtime behavior.

llm.model=mistral-7b
llm.route_class=interactive
llm.prompt_tokens=812
llm.completion_tokens=164
llm.cache_hit=false
rag.collection=platform-guides
rag.top_k=5

Step 4: alert on symptoms, not only resources

AlertTrigger idea
Slow first tokenTTFT p95 above SLO for the interactive route.
Saturated runtimeQueue wait rising while active requests remain high.
GPU memory pressureMemory close to limit for sustained window.
RAG stale contentIngestion lag beyond review policy.
Rollout regressionNew revision increases errors or latency before full traffic shift.

Failure drills

  1. Send burst traffic and watch queue wait.
  2. Increase prompt length and watch TTFT.
  3. Restart the runtime pod and watch dashboard recovery.
  4. Break vector database connectivity and verify the RAG API fails clearly.