4 docs tagged with "inference"

Inference Scaling and Cost

LLM inference scaling signals, latency phases, and cost controls on Kubernetes.

Reference architecture for LLM inference on Kubernetes.

Challenge-style vLLM Kubernetes lab for GPU scheduling, model cache, OpenAI-compatible serving, probes, metrics, and failure drills.

Production guide for running vLLM on Kubernetes with GPU scheduling, model cache strategy, runtime flags, probes, metrics, and failure modes.