GPU Capacity Incident
The bill says the GPU pool is expensive. The scheduler says inference pods are pending. The runtime team says the model is ready. Everyone is correct, but they are looking at different parts of the capacity contract.
This field note frames GPU node pool Kubernetes design as a scheduling, cost, and isolation problem.
Scenario
An interactive inference deployment needs one A100-class GPU per replica. The cluster has GPU nodes online, but two pods stay pending. Another team's batch job is running on a compatible node, a platform agent is using CPU and memory on another GPU node, and a new node group cannot scale because the pod selector does not match its labels.
Symptoms
| Symptom | What it usually means |
|---|---|
| Pods are pending | Inspect node selectors, taints, tolerations, GPU requests, and node group labels. |
| GPUs look idle | Utilization can be low while memory, topology, or scheduling constraints block placement. |
| Autoscaler does nothing | The pod may not match any scalable node group. |
| Cost/request is unstable | Batch and interactive workloads may be sharing the same accelerator lane. |
Common wrong instinct
"Buy more GPUs."
More accelerators help only after the placement contract is clear. Without labels, taints, tolerations, quotas, workload lanes, and compatible node profiles, new capacity can stay unusable or be consumed by the wrong workload class.
Production reasoning
Treat each GPU pool as a capacity product:
| Capacity decision | Why it matters |
|---|---|
| Accelerator profile | Model memory, tensor parallelism, and throughput depend on GPU type. |
| Taints and tolerations | Prevent general workloads and unapproved jobs from landing on GPU nodes. |
| Node labels | Make model-to-hardware compatibility explicit and debuggable. |
| Namespace quotas | Stop experiments and batch jobs from starving production inference. |
| Warm buffer | Protect interactive traffic from cold node provisioning and model load time. |
| Separate lanes | Keep batch, experimentation, embeddings, and interactive serving from fighting for the same queue. |
Decision checklist
- Does every GPU pod request
nvidia.com/gpuexplicitly? - Are GPU nodes tainted so only approved workloads land there?
- Can a pending pod explain which label, taint, quota, or resource blocked placement?
- Are batch jobs separated from interactive inference by quota, pool, or priority?
- Is cost reported by model, tenant, request class, and GPU profile?
Related lab
Run the GPU node pool scheduling challenge to practice inspecting placement, taints, tolerations, and accelerator profile mismatches.