Skip to main content

Inference Scaling And Cost

LLM scaling cannot be reduced to CPU utilization. The user experiences latency phases: routing, queue wait, prefill, time to first token, decoding, and stream completion.

Important metrics

MetricWhy it matters
Time to first tokenPrimary interactive latency signal.
Tokens per secondThroughput after generation starts.
Queue waitShows insufficient serving capacity or batching pressure.
Active sequencesRuntime-level pressure indicator.
KV cache usageMemory pressure and batching efficiency.
GPU memoryCapacity limit for model, context, and parallel requests.
Cost per requestBusiness-level unit economics.

Scaling patterns

  • Scale gateway separately from model servers.
  • Split interactive and batch traffic into different routes or deployments.
  • Keep warm capacity for high-value interactive models.
  • Use queue depth and runtime metrics for autoscaling where possible.
  • Track model load and readiness separately from pod start.

Cost controls

  • Model tiering: small model for simple requests, larger model for complex requests.
  • Prompt compression and retrieval quality to reduce context waste.
  • Batch jobs on cheaper capacity windows where acceptable.
  • GPU pool right-sizing by accelerator class and memory profile.
  • Request quotas by tenant, model, and priority.

Failure modes

  • Autoscaler adds pods but each pod spends minutes loading model weights.
  • Interactive traffic shares queue with long batch generations.
  • Context window growth silently increases cost.
  • GPU memory becomes the bottleneck while CPU-based HPA shows low pressure.