Skip to main content

GPU Node Pool Kubernetes

GPU node pools should be treated as a separate Kubernetes capacity class. They have different cost, scheduling constraints, startup time, failure modes, and security requirements from general worker nodes. A production LLM platform needs an explicit scheduling contract before it tunes model throughput.

LLM inference stack on Kubernetes architecture

Baseline design

ControlRecommendation
TaintsTaint GPU nodes so only approved GPU workloads land there.
TolerationsRequire explicit toleration from model-serving and batch-inference workloads.
Node labelsLabel accelerator type, memory, MIG profile, topology, zone, and lifecycle.
GPU requestsRequest nvidia.com/gpu explicitly through the device plugin.
AffinityPin workloads to compatible model, memory, and topology profiles.
System workloadsKeep general platform agents off GPU pools unless they are required for GPU operation.
Capacity bufferKeep warm headroom for interactive workloads instead of scaling only after queue pressure.

Example scheduling contract

apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-vllm
spec:
template:
spec:
nodeSelector:
accelerator.platform.example.com/type: nvidia-a100
tolerations:
- key: accelerator.platform.example.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: server
image: registry.example.com/llm/vllm:stable
resources:
limits:
nvidia.com/gpu: "1"

Node pool taxonomy

Pool typeUse whenDesign concern
Interactive inferenceUser-facing chat, agents, or APIs.Warm capacity, low queue wait, strict latency SLOs.
Batch inferenceOffline generation, evaluation, data labeling, or synthetic data.Throughput per GPU hour and deadline windows.
Embedding and rerankingRAG retrieval and ranking services.Smaller models may need different GPU or CPU profiles.
ExperimentationModel trials and benchmark jobs.Quotas and isolation prevent experiments from starving production.

Scheduling policy

  • Use taints and tolerations to prevent accidental placement on GPU nodes.
  • Use node selectors or node affinity to match model memory requirements to GPU profiles.
  • Use topology spread and zone labels when replicas need failure-domain separation.
  • Use priority classes carefully so critical serving workloads can recover during contention.
  • Use namespace quotas or admission policy to stop unbounded GPU requests.

Autoscaling implications

GPU autoscaling is slower and more expensive than normal pod scaling. A new node may need provisioning time, driver readiness, runtime daemon health, image pulls, and model loading before it can serve traffic.

Scaling signalUse for
Queue waitInteractive latency protection.
Active sequencesRuntime saturation and batching pressure.
GPU memoryPlacement safety and cache pressure.
Tokens/secThroughput per replica and cost model.
Model load timeCold-start budget and warm-pool sizing.

Failure modes

  • GPU node scales up but model loading takes longer than user-facing SLO.
  • Workload lands on incompatible GPU memory profile.
  • General workloads consume CPU and memory on GPU nodes, reducing inference capacity.
  • Autoscaler cannot provision because node selectors do not match any node group.
  • A small model and a large model share a pool and create unpredictable fragmentation.
  • Batch jobs consume all accelerators before interactive serving traffic arrives.

Metrics

  • GPU utilization, memory used, and memory fragmentation.
  • Model load time and readiness duration.
  • Queue wait and active sequences per replica.
  • Cost per generated token and cost per request class.