LLM Observability Challenge

Interactive version

Run the guided challenge with paste-output checks, hints, solution reveal, and private device progress at labs.k8sllm.online/challenges/observability.

Challenge outcome

Produce a minimum signal model that lets an on-call engineer explain whether a slow answer came from the gateway, runtime, GPU capacity, RAG retrieval, or Kubernetes rollout state.

Objective

Define and validate the telemetry needed to debug LLM workloads on Kubernetes. Generic pod health is not enough; the platform needs user-facing latency, runtime saturation, GPU pressure, and prompt-path context.

Scenario

Users report intermittent slow answers. Pods look healthy. You need to build the observability checklist that separates user latency from runtime queueing, GPU pressure, retrieval latency, and rollout regressions.

Prerequisites

Item	Requirement
Workload	LLM or mock inference workload with route labels.
Metrics	Prometheus, Grafana, or equivalent metrics access.
Logs	Runtime and gateway logs.
Traces	OpenTelemetry or a documented trace attribute plan.
Load source	A small script or client that can send repeated prompts.

Tasks

Define stable labels that survive incidents.
Map signals by layer: gateway, runtime, GPU, RAG, Kubernetes.
Build or specify dashboard panels for TTFT, tokens/sec, queue wait, GPU memory, pod restarts, and RAG retrieval latency.
Add trace attributes for model, route class, prompt size, completion size, cache hit, and RAG collection.
Define alerts on symptoms, not only resources.

metadata:
  labels:
    app.kubernetes.io/name: vllm-openai
    app.kubernetes.io/part-of: llm-platform
    workload.k8sllm.io/model: mistral-7b
    workload.k8sllm.io/route-class: interactive
    workload.k8sllm.io/team: platform-ai

Validation commands

kubectl -n llm-serving get pod --show-labels
kubectl -n llm-serving logs -l app=vllm-openai --tail=120
kubectl -n llm-serving top pod
kubectl -n llm-serving rollout status deployment/vllm-openai

Trace attributes should carry this shape or an equivalent:

llm.model=mistral-7b
llm.route_class=interactive
llm.prompt_tokens=812
llm.completion_tokens=164
llm.cache_hit=false
rag.collection=platform-guides
rag.top_k=5

Self-check checklist

Metrics can answer whether TTFT or tokens/sec changed.
Queue wait is visible separately from GPU utilization.
GPU memory and allocation failures are visible.
Prompt size and route class are attached to traces or logs.
Rollout state is visible next to latency signals.
Alerts target user symptoms before resource graphs become obvious.

Hints

Start with questions an on-call engineer must answer, then choose metrics.
Do not rely on GPU utilization alone; queue wait can rise first.
Keep high-cardinality prompt text out of labels. Use token counts and route classes instead.

Expected signals

Layer	Healthy signal
Gateway	RPS, errors, request size, route, tenant, status.
Runtime	TTFT, tokens/sec, queue wait, active requests, batch size.
GPU	Utilization, memory, temperature, allocation failures.
RAG	Retrieval latency, top-k score, source count, evaluation result.
Kubernetes	Restarts, throttling, OOM, scheduling delay, rollout state.

Failure drill

Send burst traffic and increase prompt length at the same time. The expected learning is whether the platform can separate queue saturation from prompt-size latency.

Objective​

Scenario​

Prerequisites​

Tasks​

Validation commands​

Self-check checklist​

Hints​

Expected signals​

Failure drill​

Related guides​