Skip to main content

GPU Node Pool Scheduling for LLM Inference

GPU node pool scheduling for LLM inference is a capacity contract. The platform must make accelerator type, taints, labels, quotas, priority, and workload lane visible before trying to optimize throughput.

Last reviewed: June 8, 2026. Use this page when GPU nodes exist but inference pods are pending, misplaced, or too expensive for the traffic served.

Production Kubernetes cluster architecture

Scenario

An interactive model needs A100-class GPUs. The cluster has GPU nodes online, but pods are pending. A batch job consumed the compatible profile, one node is blocked by a taint mismatch, and autoscaling does not start because the pod selector does not match a scalable node group.

Decision table

Scheduling controlProduction use
Node labelsEncode accelerator type, memory, topology, zone, lifecycle, and profile.
TaintsKeep unapproved and general workloads away from expensive GPU nodes.
TolerationsRequire approved model serving workloads to opt into GPU placement.
GPU requestsUse nvidia.com/gpu explicitly through the device plugin path.
QuotasStop experiments and batch jobs from starving interactive inference.
Warm bufferProtect user-facing routes from node provisioning and model cold starts.

Commands and checks

kubectl get nodes -L accelerator,nvidia.com/gpu.product
kubectl -n llm-serving describe pod <model-pod>
kubectl -n llm-serving get resourcequota
kubectl -n llm-serving get pod <model-pod> -o wide
CheckPass signal
Compatible profileNode labels match the model's GPU memory and accelerator requirement.
Taint contractGPU nodes reject general workloads and accept approved inference workloads.
Pending reasonScheduler events identify the exact missing label, taint, quota, or GPU resource.
Workload laneBatch and interactive workloads do not fight for the same unbounded pool.

Run the GPU node pool scheduling lab to practice unschedulable-pod debugging.