Skip to main content

vLLM Kubernetes Production Deployment

vLLM on Kubernetes should be deployed as a runtime contract, not just a container. The platform still owns GPU placement, route policy, health, metrics, rollout, rollback, and model artifact control.

Last reviewed: June 8, 2026. Use this page before promoting a vLLM deployment from experiment to production route.

LLM inference stack on Kubernetes architecture

Scenario

A team deploys vLLM with an OpenAI-compatible route. The endpoint responds in staging, but production review cannot answer which GPU profile is required, how model cache behaves, which readiness probe proves model load, or which metric will trigger rollback.

Decision table

Deployment areaProduction requirement
Model artifactPin model ID, revision, tokenizer behavior, and context limit.
Runtime imagePin image version and startup flags; record why flags changed.
GPU schedulingRequest nvidia.com/gpu and use labels, taints, tolerations, and compatible node profiles.
Model cacheMake model download, cache path, and warmup time measurable.
ReadinessProbe model readiness, not only process startup.
MetricsEmit TTFT, queue wait, tokens/sec, GPU memory, errors, and model revision.
RollbackRoll back route, runtime image, and model artifact as one reviewed unit.

Commands and checks

kubectl -n llm-serving get deploy,pod,svc,endpoints
kubectl -n llm-serving describe pod <vllm-pod>
kubectl -n llm-serving logs deploy/vllm-runtime --tail=120
curl -sS "$GATEWAY/v1/models"
CheckPass signal
GPU contractPod requests nvidia.com/gpu and lands on the intended accelerator node.
Model loadedLogs or readiness endpoint prove the model is loaded before route traffic.
Runtime visibilityMetrics include TTFT, queue wait, tokens/sec, and model revision.
Rollback evidenceRollout history and artifact revision are available to the on-call engineer.

Run the vLLM Kubernetes deployment lab to practice the runtime contract before live traffic.