Skip to main content

Kubernetes + LLM Platform Guide

This site is for engineers who already understand Kubernetes basics and need to design, operate, or audit a production platform with AI workloads. The focus is senior platform engineering: control planes, worker pools, policy, observability, GPU scheduling, model serving, RAG, and cost-aware operations.

How to learn

  1. Start with Kubernetes Core to align on the control plane, workloads, networking, and storage model.
  2. Use Production Best Practices to turn primitives into operational baselines.
  3. Use Platform Services to choose services by capability, not hype.
  4. Learn LLM On Kubernetes as a capacity, latency, GPU scheduling, and cost problem.
  5. Use Reference Architectures as blueprints for design reviews.

Editorial principles

  • Every guide should explain the decision being made, the main failure modes, and the metrics that prove the system is healthy.
  • Official docs are the source of truth. Vendor blogs can support examples, but they should not override API behavior or security guarantees.
  • Every architecture diagram should make boundaries, data flow, ownership, and operating concerns visible.

Production platform overview

Production Kubernetes cluster architecture

A strong Kubernetes platform is more than a cluster that can run workloads. It needs a reliable control plane, separated worker pools, explicit policy, enough telemetry to debug incidents, and a delivery flow that can roll back safely.

Short review checklist

QuestionHealthy signal
Who owns the cluster baseline?The platform team owns policy, versioning, and change review.
Which metric drives workload scale?CPU and memory are a baseline; queue depth, RPS, token latency, or business metrics are used when they are better signals.
How are security exceptions handled?Exceptions have expiry, approval, audit trail, and policy-as-code coverage.
Does LLM serving have its own SLO?TTFT, tokens/sec, queue wait, GPU memory, and cost/request are tracked by route class.