Skip to main content

Production Scaling Decision Guide

Scaling decisions should start from demand shape, not tool preference. Decide what changes, what signal proves demand, and how long the platform may take to react.

Decision matrix

Demand shapePrimary mechanismPrimary signal
User-facing HTTP trafficHPARPS, latency, CPU saturation as secondary.
Queue-backed workersKEDA or HPA with external metricsQueue depth, oldest message age, processing rate.
Unknown resource requestsVPA in recommendation mode firstHistorical CPU and memory usage.
Pending pods due to capacityCluster Autoscaler or KarpenterUnschedulable pods and node provisioning latency.
GPU inferenceRuntime-aware autoscalingQueue wait, active sequences, GPU memory, TTFT.
  1. Fix requests, limits, probes, and rollout strategy.
  2. Load test the workload at expected and failure traffic levels.
  3. Choose a scale metric tied to user impact.
  4. Add node autoscaling only after pod placement constraints are realistic.
  5. Validate scale-up and scale-down behavior with real delay budgets.

Capacity buffers

Do not rely on just-in-time scale-up for every class of traffic. Keep explicit buffers for:

  • Critical ingress and DNS infrastructure.
  • High-value interactive services.
  • GPU model servers with long model-load time.
  • Batch windows with deadline commitments.

Failure review questions

  • Did replicas increase but latency stay high?
  • Did nodes scale but pods remain pending?
  • Did scale-up overload a shared dependency?
  • Did scale-down evict useful warm capacity?
  • Did the autoscaler optimize infrastructure metrics while user SLO still burned?