Skip to main content

Scaling And Autoscaling

Scaling is a control-loop design problem. You need to decide which layer scales, which metric drives it, what delay is acceptable, and how much overprovisioning protects the user experience.

Scaling layers

LayerToolingUse when
Pod replicasHPARequest load changes and pods can scale horizontally.
Pod resourcesVPARequests are unknown or workloads are vertically bounded. Avoid blind use with latency-critical services.
NodesCluster Autoscaler or KarpenterPending pods need more node capacity.
Event-drivenKEDAQueue length, stream lag, or custom external metrics drive demand.
GPU servingRuntime-specific autoscalingQueue wait, tokens/sec, GPU memory, and batch efficiency matter more than CPU.

Decision table

SituationPreferred signal
Web APIRPS, p95 latency, CPU only as secondary.
Worker queueQueue depth, age of oldest message, processing rate.
LLM inferenceTime to first token, queue wait, active sequences, GPU memory.
Batch jobsDeadline, throughput, and cost window.

Guardrails

  • Set requests before HPA. Without requests, CPU and memory utilization ratios are meaningless.
  • Use load tests to validate scale-up time, not only steady-state capacity.
  • Keep headroom for node provisioning delay. Autoscaling is not instant.
  • Separate node pools for system, general app, and GPU workloads.
  • Use PodDisruptionBudgets carefully. Too strict can block node rotation; too loose can cause outage during maintenance.

Failure modes

  • HPA scales on CPU while real bottleneck is I/O, locks, database pool, or token generation.
  • Cluster autoscaler cannot add nodes because pod constraints do not match any node group.
  • GPU node pools scale slowly and cold-start model load dominates user latency.
  • Too many replicas overload downstream dependencies.