Skip to main content

LLM Inference Stack

LLM inference stack on Kubernetes architecture

Intent

Separate gateway behavior, autoscaling, model runtime, GPU capacity, and telemetry. This keeps each layer tunable without changing the entire platform.

Key decisions

  • Gateway handles auth, quota, and route policy.
  • Runtime handles batching, KV cache, streaming, and model-specific serving.
  • GPU pool is isolated with taints, labels, and compatible node profiles.
  • Autoscaling uses inference-specific signals where possible.
  • Telemetry reports both user latency and GPU economics.

Review signals

  • Interactive and batch traffic do not share the same queue by default.
  • Model loading time is measured and budgeted.
  • Runtime version and model version are visible in telemetry.
  • Cost per request can be calculated by model and tenant.