Kubernetes + LLM Platform Guide

This site is for engineers who already understand Kubernetes basics and need to design, operate, or audit a production platform with AI workloads. K8sLLM is the shorthand: a K8s LLM guide for senior platform engineering across control planes, worker pools, policy, observability, GPU scheduling, model serving, RAG, and cost-aware operations.

How to learn

Start with Kubernetes Core to align on the control plane, workloads, networking, and storage model.
Use K8s LLM: Kubernetes LLM Platform Guide as the keyword-level map for AI infrastructure on Kubernetes.
Use Production Best Practices to turn primitives into operational baselines.
Use Platform Services to choose services by capability, not hype.
Learn LLM on Kubernetes as a capacity, latency, GPU scheduling, and cost problem.
Run Kubernetes LLM Labs to turn concepts into cluster exercises.
Use Reference Architectures as blueprints for design reviews.

Editorial principles

Every guide should explain the decision being made, the main failure modes, and the metrics that prove the system is healthy.
Official docs are the source of truth. Vendor blogs can support examples, but they should not override API behavior or security guarantees.
Every architecture diagram should make boundaries, data flow, ownership, and operating concerns visible.
The About K8sLLM page explains editorial intent, source policy, and review cadence.

Production platform overview

Production Kubernetes cluster architecture

A strong Kubernetes platform is more than a cluster that can run workloads. It needs a reliable control plane, separated worker pools, explicit policy, enough telemetry to debug incidents, and a delivery flow that can roll back safely.

Short review checklist

Question	Healthy signal
Who owns the cluster baseline?	The platform team owns policy, versioning, and change review.
Which metric drives workload scale?	CPU and memory are a baseline; queue depth, RPS, token latency, or business metrics are used when they are better signals.
How are security exceptions handled?	Exceptions have expiry, approval, audit trail, and policy-as-code coverage.
Does LLM serving have its own SLO?	TTFT, tokens/sec, queue wait, GPU memory, and cost/request are tracked by route class.

Lab track

If you want a practical path, start with the vLLM inference lab, then run the production readiness lab. The labs are static runbooks designed for your own cluster, not a hosted sandbox.

How to learn​

Editorial principles​

Production platform overview​

Short review checklist​

Lab track​

How to learn

Editorial principles

Production platform overview

Short review checklist

Lab track