Model Serving Options
No single runtime is best for every LLM workload on Kubernetes. Choose based on batching, model lifecycle, distributed serving, framework integration, rollout model, team ownership, and how much abstraction the platform team must provide.
Comparison
| Option | Strong fit | Watch out |
|---|---|---|
| vLLM | High-throughput text generation, OpenAI-compatible serving, efficient KV cache. | You still need deployment, routing, auth, autoscaling, and telemetry around it. |
| KServe | Kubernetes-native inference abstraction with revisions and autoscaling integrations. | Abstraction may hide runtime-specific knobs needed for large LLM tuning. |
| Ray Serve | Python-native distributed serving, complex pipelines, multi-model orchestration. | Ray cluster operations become part of platform scope. |
| Triton | Optimized inference server, model repository patterns, multi-framework serving. | LLM-specific serving ergonomics may require more integration work. |
Decision model
- Use vLLM when generation performance and serving a small set of models is the immediate priority.
- Use KServe when you need a platform-level inference API across teams and model types.
- Use Ray Serve when the serving graph includes Python logic, multi-step pipelines, or distributed orchestration.
- Use Triton when standardized inference server operations and model repository control are more important than LLM-specific simplicity.
For the most common platform decision, start with KServe vs Ray Serve.
Platform requirements around every option
- Gateway auth, quotas, and tenant routing.
- Immutable model and runtime versions.
- GPU-aware scheduling with a defined GPU node pool Kubernetes contract.
- Autoscaling signals beyond CPU.
- SLO dashboards for TTFT, output throughput, queue wait, and error classes.
- Clear rollback unit: model artifact, runtime image, serving config, or endpoint resource.
Failure modes
- Runtime is benchmarked in isolation but fails under gateway, auth, and streaming behavior.
- Model revision changes memory profile and breaks placement.
- Autoscaling increases replicas but capacity remains bottlenecked by shared vector DB or reranker.
- One route mixes interactive and batch traffic, destroying tail latency.