Skip to main content

Scaling and Autoscaling

Scaling is a control-loop design problem. You need to decide which layer scales, which metric drives it, what delay is acceptable, and how much overprovisioning protects the user experience.

Scaling layers

Layer	Tooling	Use when
Pod replicas	HPA	Request load changes and pods can scale horizontally.
Pod resources	VPA	Requests are unknown or workloads are vertically bounded. Avoid blind use with latency-critical services.
Nodes	Cluster Autoscaler or Karpenter	Pending pods need more node capacity.
Event-driven	KEDA	Queue length, stream lag, or custom external metrics drive demand.
GPU serving	Runtime-specific autoscaling	Queue wait, tokens/sec, GPU memory, and batch efficiency matter more than CPU.

Decision table

Situation	Preferred signal
Web API	RPS, p95 latency, CPU only as secondary.
Worker queue	Queue depth, age of oldest message, processing rate.
LLM inference	Time to first token, queue wait, active sequences, GPU memory.
Batch jobs	Deadline, throughput, and cost window.

Guardrails

Set requests before HPA. Without requests, CPU and memory utilization ratios are meaningless.
Use load tests to validate scale-up time, not only steady-state capacity.
Keep headroom for node provisioning delay. Autoscaling is not instant.
Separate node pools for system, general app, and GPU workloads.
Use PodDisruptionBudgets carefully. Too strict can block node rotation; too loose can cause outage during maintenance.

Failure modes

HPA scales on CPU while real bottleneck is I/O, locks, database pool, or token generation.
Cluster autoscaler cannot add nodes because pod constraints do not match any node group.
GPU node pools scale slowly and cold-start model load dominates user latency.
Too many replicas overload downstream dependencies.

Scaling layers
Decision table
Guardrails
Failure modes