Production Readiness Challenge

Interactive version

Run the guided challenge with paste-output checks, hints, solution reveal, and private device progress at labs.k8sllm.online/challenges/production-readiness.

Challenge outcome

Make a launch decision: approve, approve with conditions, or block. The useful output is a clear risk list with owners, not a perfect score.

Objective

Run a production readiness review for a Kubernetes LLM workload before exposing it to real users or approving a new serving runtime in the platform.

Scenario

A model endpoint passed a functional demo. Before production traffic, the platform team must verify namespace isolation, quota, rollback, telemetry, cost ownership, and launch decision criteria.

Prerequisites

Item	Requirement
Workload	Existing model server deployment or equivalent test workload.
Access	Permission to inspect RBAC, NetworkPolicy, quotas, pods, rollout history, and logs.
Telemetry	Dashboard, metrics endpoint, or trace/log access.
Owner map	Known workload owner, rollback owner, and SLO owner.

Tasks

Verify namespace boundaries and service account permissions.
Verify resource requests, GPU limits, ResourceQuota, and LimitRange.
Confirm a rollback path exists and works.
Validate launch telemetry for user latency, runtime saturation, GPU pressure, and cost.
Write a launch decision with accepted exceptions and owners.

Validation commands

kubectl auth can-i get secrets --namespace default
kubectl auth can-i list pods --namespace llm-serving
kubectl -n llm-serving get networkpolicy
kubectl -n llm-serving get resourcequota
kubectl -n llm-serving get limitrange
kubectl -n llm-serving describe pod -l app=vllm-openai
kubectl -n llm-serving rollout history deployment/vllm-openai
kubectl -n llm-serving rollout status deployment/vllm-openai

Use this launch decision format:

Decision: approve, approve with conditions, or block
Primary risk:
Accepted exceptions:
Rollback owner:
SLO owner:
Review date:

Self-check checklist

Workload cannot access unrelated namespaces or secrets.
GPU placement is explicit and quota-bound.
Previous revision remains deployable.
Dashboards or queries expose TTFT, tokens/sec, queue wait, GPU memory, and pod restarts.
Cost can be grouped by model, route, tenant, or workload class.
Exceptions have owner and review date.

Hints

Do not approve a launch only because pods are ready.
A rollback that has never been tested is a guess.
If cost per request is unknown, launch volume limits should be conservative.

Expected signals

Area	Healthy signal
Security	Namespace has RBAC, NetworkPolicy, and secret ownership.
Scheduling	Pod cannot land on general-purpose nodes.
Rollout	Previous revision remains deployable.
Observability	Incident triage can start from dashboards and traces.
Cost	Requests can be grouped by model, route, tenant, or workload class.

Failure drill

Deploy a revision with an invalid model name, confirm readiness fails, roll back, and confirm the previous model endpoint serves again.

Objective​

Scenario​

Prerequisites​

Tasks​

Validation commands​

Self-check checklist​

Hints​

Expected signals​

Failure drill​

Related guides​