vLLM Inference Challenge

Interactive version

Run the guided challenge with paste-output checks, hints, solution reveal, and private device progress at labs.k8sllm.online/challenges/vllm-inference.

Challenge outcome

Deploy a GPU-backed vLLM endpoint and prove why it scheduled on GPU capacity, when it became ready, how traffic reaches it, and which metrics describe user-facing latency.

Objective

Build the minimum production contract around vLLM on Kubernetes: GPU placement, model artifact access, runtime flags, health checks, token latency, and rollback readiness.

Scenario

Your platform team needs to expose one OpenAI-compatible model endpoint for an internal product team. The endpoint must run only on GPU nodes, report readiness only after model load, and produce enough telemetry to debug TTFT and queue wait.

Prerequisites

Item	Requirement
Cluster	Kubernetes with a GPU node pool.
GPU runtime	NVIDIA device plugin or GPU Operator installed.
Namespace	Dedicated namespace such as `llm-serving`.
Model access	A model available from a registry, object store, or mounted cache.
Telemetry	Access to pod logs and metrics scraping.

Tasks

Create and label a serving namespace.
Deploy a vLLM runtime with explicit GPU requests, node selector, and toleration.
Expose the runtime through a ClusterIP Service.
Run one short prompt and one longer generation request.
Capture the readiness state, startup duration, TTFT, tokens/sec, queue wait, and GPU memory signal.

kubectl create namespace llm-serving
kubectl label namespace llm-serving workload-class=llm
kubectl get namespace llm-serving --show-labels

Use this workload as the starting manifest and adjust labels, model name, and resource values to your cluster:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-openai
  namespace: llm-serving
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-openai
  template:
    metadata:
      labels:
        app: vllm-openai
        model: mistral-7b
    spec:
      nodeSelector:
        accelerator: nvidia
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model
            - mistralai/Mistral-7B-Instruct-v0.2
            - --served-model-name
            - mistral-7b
            - --host
            - 0.0.0.0
            - --port
            - "8000"
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              cpu: "4"
              memory: 24Gi
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 60
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 180
            periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-openai
  namespace: llm-serving
spec:
  selector:
    app: vllm-openai
  ports:
    - name: http
      port: 8000
      targetPort: http

Validation commands

kubectl -n llm-serving get pod -o wide
kubectl -n llm-serving describe pod -l app=vllm-openai
kubectl -n llm-serving logs -l app=vllm-openai --tail=80
kubectl -n llm-serving port-forward svc/vllm-openai 8000:8000
curl -s http://127.0.0.1:8000/v1/models
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mistral-7b",
    "messages": [{"role": "user", "content": "Explain Kubernetes GPU scheduling in five sentences."}],
    "temperature": 0.2,
    "max_tokens": 180
  }'

Self-check checklist

The pod lands only on a GPU node.
The workload requests nvidia.com/gpu.
Readiness stays false until vLLM can serve.
The Service reaches the OpenAI-compatible endpoint.
Startup time and model load behavior are visible in logs.
TTFT, tokens/sec, queue wait, and GPU memory are captured or explicitly marked as missing.

Hints

If the pod is pending, inspect node labels, taints, tolerations, and available GPU capacity.
If readiness never passes, check model download, cache mount, runtime flags, and GPU memory.
If latency is unstable, compare short prompts with long prompts before changing autoscaling.

Expected signals

Signal	Healthy result
Pod scheduling	Pod lands only on GPU nodes and consumes expected GPU count.
Startup	Model load time is known and included in rollout timing.
Health	Readiness stays false until the runtime can serve.
Latency	TTFT and tokens/sec are measured by model and route.
Saturation	GPU memory, KV cache pressure, and queue wait are visible.

Failure drill

Remove the node selector or toleration, then redeploy. The expected learning is whether platform policy catches accidental scheduling drift before a model server lands on the wrong capacity.

Objective​

Scenario​

Prerequisites​

Tasks​

Validation commands​

Self-check checklist​

Hints​

Expected signals​

Failure drill​

Related guides​