Skip to main content

vLLM Inference Lab

This lab builds the minimum production contract around vLLM on Kubernetes: GPU placement, model artifact access, runtime flags, health checks, token latency, and rollback readiness.

Lab outcome

By the end, you should be able to explain why the pod was scheduled on a GPU node, how the model was loaded, how traffic reaches the runtime, and which signal proves user-facing latency is acceptable.

Prerequisites

ItemRequirement
ClusterKubernetes with a GPU node pool.
GPU runtimeNVIDIA device plugin or GPU Operator installed.
NamespaceDedicated namespace such as llm-serving.
Model accessA model available from a registry, object store, or mounted cache.
TelemetryAccess to pod logs and metrics scraping.

Step 1: create the serving namespace

kubectl create namespace llm-serving
kubectl label namespace llm-serving workload-class=llm

Validate:

kubectl get namespace llm-serving --show-labels

Step 2: define the GPU scheduling contract

Use explicit GPU requests, a node selector, and a toleration. Adjust the label and toleration to your cluster.

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-openai
namespace: llm-serving
spec:
replicas: 1
selector:
matchLabels:
app: vllm-openai
template:
metadata:
labels:
app: vllm-openai
model: mistral-7b
spec:
nodeSelector:
accelerator: nvidia
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- mistralai/Mistral-7B-Instruct-v0.2
- --served-model-name
- mistral-7b
- --host
- 0.0.0.0
- --port
- "8000"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 24Gi
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 60
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 180
periodSeconds: 20

Review questions:

  • Which label proves this workload can only land on GPU capacity?
  • Which team owns the node labels and tolerations?
  • What happens if the model needs two GPUs instead of one?

Step 3: expose the runtime

apiVersion: v1
kind: Service
metadata:
name: vllm-openai
namespace: llm-serving
spec:
selector:
app: vllm-openai
ports:
- name: http
port: 8000
targetPort: http

Validate inside the cluster:

kubectl -n llm-serving port-forward svc/vllm-openai 8000:8000
curl -s http://127.0.0.1:8000/v1/models

Step 4: run a latency probe

Use one small request and one longer generation request. Capture time to first token if your client supports streaming measurement.

curl -s http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mistral-7b",
"messages": [{"role": "user", "content": "Explain Kubernetes GPU scheduling in five sentences."}],
"temperature": 0.2,
"max_tokens": 180
}'

Validation signals

SignalHealthy result
Pod schedulingPod lands only on GPU nodes and consumes expected GPU count.
StartupModel load time is known and included in rollout timing.
HealthReadiness stays false until the runtime can serve.
LatencyTTFT and tokens/sec are measured by model and route.
SaturationGPU memory, KV cache pressure, and queue wait are visible.

Failure drills

DrillExpected learning
Remove the node selectorConfirm policy catches accidental scheduling drift.
Reduce memory requestObserve whether eviction, OOM, or slow startup happens first.
Send burst trafficFind the saturation point before production users do.
Block model downloadVerify startup failure is obvious in logs and alerts.