Skip to main content

GPU Capacity Incident

The bill says the GPU pool is expensive. The scheduler says inference pods are pending. The runtime team says the model is ready. Everyone is correct, but they are looking at different parts of the capacity contract.

This field note frames GPU node pool Kubernetes design as a scheduling, cost, and isolation problem.

Scenario

An interactive inference deployment needs one A100-class GPU per replica. The cluster has GPU nodes online, but two pods stay pending. Another team's batch job is running on a compatible node, a platform agent is using CPU and memory on another GPU node, and a new node group cannot scale because the pod selector does not match its labels.

Symptoms

SymptomWhat it usually means
Pods are pendingInspect node selectors, taints, tolerations, GPU requests, and node group labels.
GPUs look idleUtilization can be low while memory, topology, or scheduling constraints block placement.
Autoscaler does nothingThe pod may not match any scalable node group.
Cost/request is unstableBatch and interactive workloads may be sharing the same accelerator lane.

Common wrong instinct

"Buy more GPUs."

More accelerators help only after the placement contract is clear. Without labels, taints, tolerations, quotas, workload lanes, and compatible node profiles, new capacity can stay unusable or be consumed by the wrong workload class.

Production reasoning

Treat each GPU pool as a capacity product:

Capacity decisionWhy it matters
Accelerator profileModel memory, tensor parallelism, and throughput depend on GPU type.
Taints and tolerationsPrevent general workloads and unapproved jobs from landing on GPU nodes.
Node labelsMake model-to-hardware compatibility explicit and debuggable.
Namespace quotasStop experiments and batch jobs from starving production inference.
Warm bufferProtect interactive traffic from cold node provisioning and model load time.
Separate lanesKeep batch, experimentation, embeddings, and interactive serving from fighting for the same queue.

Decision checklist

  • Does every GPU pod request nvidia.com/gpu explicitly?
  • Are GPU nodes tainted so only approved workloads land there?
  • Can a pending pod explain which label, taint, quota, or resource blocked placement?
  • Are batch jobs separated from interactive inference by quota, pool, or priority?
  • Is cost reported by model, tenant, request class, and GPU profile?

Run the GPU node pool scheduling challenge to practice inspecting placement, taints, tolerations, and accelerator profile mismatches.