Skip to main content

Inference Scaling and Cost

LLM scaling cannot be reduced to CPU utilization. The user experiences latency phases: routing, queue wait, prefill, time to first token, decoding, and stream completion.

Important metrics

Metric	Why it matters
Time to first token	Primary interactive latency signal.
Tokens per second	Throughput after generation starts.
Queue wait	Shows insufficient serving capacity or batching pressure.
Active sequences	Runtime-level pressure indicator.
KV cache usage	Memory pressure and batching efficiency.
GPU memory	Capacity limit for model, context, and parallel requests.
Cost per request	Business-level unit economics.

Scaling patterns

Scale gateway separately from model servers.
Split interactive and batch traffic into different routes or deployments.
Keep warm capacity for high-value interactive models.
Use queue depth and runtime metrics for autoscaling where possible.
Track model load and readiness separately from pod start.

Cost controls

Model tiering: small model for simple requests, larger model for complex requests.
Prompt compression and retrieval quality to reduce context waste.
Batch jobs on cheaper capacity windows where acceptable.
GPU pool right-sizing by accelerator class and memory profile.
Request quotas by tenant, model, and priority.

Failure modes

Autoscaler adds pods but each pod spends minutes loading model weights.
Interactive traffic shares queue with long batch generations.
Context window growth silently increases cost.
GPU memory becomes the bottleneck while CPU-based HPA shows low pressure.

Important metrics
Scaling patterns
Cost controls
Failure modes