Kubernetes + LLM Platform Guide
This site is for engineers who already understand Kubernetes basics and need to design, operate, or audit a production platform with AI workloads. The focus is senior platform engineering: control planes, worker pools, policy, observability, GPU scheduling, model serving, RAG, and cost-aware operations.
How to learn
- Start with Kubernetes Core to align on the control plane, workloads, networking, and storage model.
- Use Production Best Practices to turn primitives into operational baselines.
- Use Platform Services to choose services by capability, not hype.
- Learn LLM On Kubernetes as a capacity, latency, GPU scheduling, and cost problem.
- Use Reference Architectures as blueprints for design reviews.
Editorial principles
- Every guide should explain the decision being made, the main failure modes, and the metrics that prove the system is healthy.
- Official docs are the source of truth. Vendor blogs can support examples, but they should not override API behavior or security guarantees.
- Every architecture diagram should make boundaries, data flow, ownership, and operating concerns visible.
Production platform overview
A strong Kubernetes platform is more than a cluster that can run workloads. It needs a reliable control plane, separated worker pools, explicit policy, enough telemetry to debug incidents, and a delivery flow that can roll back safely.
Short review checklist
| Question | Healthy signal |
|---|---|
| Who owns the cluster baseline? | The platform team owns policy, versioning, and change review. |
| Which metric drives workload scale? | CPU and memory are a baseline; queue depth, RPS, token latency, or business metrics are used when they are better signals. |
| How are security exceptions handled? | Exceptions have expiry, approval, audit trail, and policy-as-code coverage. |
| Does LLM serving have its own SLO? | TTFT, tokens/sec, queue wait, GPU memory, and cost/request are tracked by route class. |