Production Scaling Decision Guide
Scaling decisions should start from demand shape, not tool preference. Decide what changes, what signal proves demand, and how long the platform may take to react.
Decision matrix
| Demand shape | Primary mechanism | Primary signal |
|---|---|---|
| User-facing HTTP traffic | HPA | RPS, latency, CPU saturation as secondary. |
| Queue-backed workers | KEDA or HPA with external metrics | Queue depth, oldest message age, processing rate. |
| Unknown resource requests | VPA in recommendation mode first | Historical CPU and memory usage. |
| Pending pods due to capacity | Cluster Autoscaler or Karpenter | Unschedulable pods and node provisioning latency. |
| GPU inference | Runtime-aware autoscaling | Queue wait, active sequences, GPU memory, TTFT. |
Recommended sequence
- Fix requests, limits, probes, and rollout strategy.
- Load test the workload at expected and failure traffic levels.
- Choose a scale metric tied to user impact.
- Add node autoscaling only after pod placement constraints are realistic.
- Validate scale-up and scale-down behavior with real delay budgets.
Capacity buffers
Do not rely on just-in-time scale-up for every class of traffic. Keep explicit buffers for:
- Critical ingress and DNS infrastructure.
- High-value interactive services.
- GPU model servers with long model-load time.
- Batch windows with deadline commitments.
Failure review questions
- Did replicas increase but latency stay high?
- Did nodes scale but pods remain pending?
- Did scale-up overload a shared dependency?
- Did scale-down evict useful warm capacity?
- Did the autoscaler optimize infrastructure metrics while user SLO still burned?