Model Serving Options

No single runtime is best for every LLM workload on Kubernetes. Choose based on batching, model lifecycle, distributed serving, framework integration, rollout model, team ownership, and how much abstraction the platform team must provide.

Comparison

Option	Strong fit	Watch out
vLLM	High-throughput text generation, OpenAI-compatible serving, efficient KV cache.	You still need deployment, routing, auth, autoscaling, and telemetry around it.
KServe	Kubernetes-native inference abstraction with revisions and autoscaling integrations.	Abstraction may hide runtime-specific knobs needed for large LLM tuning.
Ray Serve	Python-native distributed serving, complex pipelines, multi-model orchestration.	Ray cluster operations become part of platform scope.
Triton	Optimized inference server, model repository patterns, multi-framework serving.	LLM-specific serving ergonomics may require more integration work.

Decision model

Use vLLM when generation performance and serving a small set of models is the immediate priority.
Use KServe when you need a platform-level inference API across teams and model types.
Use Ray Serve when the serving graph includes Python logic, multi-step pipelines, or distributed orchestration.
Use Triton when standardized inference server operations and model repository control are more important than LLM-specific simplicity.

For the most common platform decision, start with KServe vs Ray Serve.

Platform requirements around every option

Gateway auth, quotas, and tenant routing.
Immutable model and runtime versions.
GPU-aware scheduling with a defined GPU node pool Kubernetes contract.
Autoscaling signals beyond CPU.
SLO dashboards for TTFT, output throughput, queue wait, and error classes.
Clear rollback unit: model artifact, runtime image, serving config, or endpoint resource.

Failure modes

Runtime is benchmarked in isolation but fails under gateway, auth, and streaming behavior.
Model revision changes memory profile and breaks placement.
Autoscaling increases replicas but capacity remains bottlenecked by shared vector DB or reranker.
One route mixes interactive and batch traffic, destroying tail latency.

Comparison​

Decision model​

Platform requirements around every option​

Failure modes​

Related pages​

Comparison

Decision model

Platform requirements around every option

Failure modes

Related pages