vLLM Kubernetes Production Deployment

vLLM on Kubernetes should be deployed as a runtime contract, not just a container. The platform still owns GPU placement, route policy, health, metrics, rollout, rollback, and model artifact control.

Last reviewed: June 8, 2026. Use this page before promoting a vLLM deployment from experiment to production route.

LLM inference stack on Kubernetes architecture

Scenario

A team deploys vLLM with an OpenAI-compatible route. The endpoint responds in staging, but production review cannot answer which GPU profile is required, how model cache behaves, which readiness probe proves model load, or which metric will trigger rollback.

Decision table

Deployment area	Production requirement
Model artifact	Pin model ID, revision, tokenizer behavior, and context limit.
Runtime image	Pin image version and startup flags; record why flags changed.
GPU scheduling	Request `nvidia.com/gpu` and use labels, taints, tolerations, and compatible node profiles.
Model cache	Make model download, cache path, and warmup time measurable.
Readiness	Probe model readiness, not only process startup.
Metrics	Emit TTFT, queue wait, tokens/sec, GPU memory, errors, and model revision.
Rollback	Roll back route, runtime image, and model artifact as one reviewed unit.

Commands and checks

kubectl -n llm-serving get deploy,pod,svc,endpoints
kubectl -n llm-serving describe pod <vllm-pod>
kubectl -n llm-serving logs deploy/vllm-runtime --tail=120
curl -sS "$GATEWAY/v1/models"

Check	Pass signal
GPU contract	Pod requests `nvidia.com/gpu` and lands on the intended accelerator node.
Model loaded	Logs or readiness endpoint prove the model is loaded before route traffic.
Runtime visibility	Metrics include TTFT, queue wait, tokens/sec, and model revision.
Rollback evidence	Rollout history and artifact revision are available to the on-call engineer.

Run the vLLM Kubernetes deployment lab to practice the runtime contract before live traffic.

Scenario​

Decision table​

Commands and checks​

Related lab​

Related pages​

Scenario

Decision table

Commands and checks

Related lab

Related pages