GPU Node Pool Kubernetes

GPU node pools should be treated as a separate Kubernetes capacity class. They have different cost, scheduling constraints, startup time, failure modes, and security requirements from general worker nodes. A production LLM platform needs an explicit scheduling contract before it tunes model throughput.

LLM inference stack on Kubernetes architecture

Baseline design

Control	Recommendation
Taints	Taint GPU nodes so only approved GPU workloads land there.
Tolerations	Require explicit toleration from model-serving and batch-inference workloads.
Node labels	Label accelerator type, memory, MIG profile, topology, zone, and lifecycle.
GPU requests	Request `nvidia.com/gpu` explicitly through the device plugin.
Affinity	Pin workloads to compatible model, memory, and topology profiles.
System workloads	Keep general platform agents off GPU pools unless they are required for GPU operation.
Capacity buffer	Keep warm headroom for interactive workloads instead of scaling only after queue pressure.

Example scheduling contract

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-vllm
spec:
  template:
    spec:
      nodeSelector:
        accelerator.platform.example.com/type: nvidia-a100
      tolerations:
        - key: accelerator.platform.example.com/gpu
          operator: Equal
          value: "true"
          effect: NoSchedule
      containers:
        - name: server
          image: registry.example.com/llm/vllm:stable
          resources:
            limits:
              nvidia.com/gpu: "1"

Node pool taxonomy

Pool type	Use when	Design concern
Interactive inference	User-facing chat, agents, or APIs.	Warm capacity, low queue wait, strict latency SLOs.
Batch inference	Offline generation, evaluation, data labeling, or synthetic data.	Throughput per GPU hour and deadline windows.
Embedding and reranking	RAG retrieval and ranking services.	Smaller models may need different GPU or CPU profiles.
Experimentation	Model trials and benchmark jobs.	Quotas and isolation prevent experiments from starving production.

Scheduling policy

Use taints and tolerations to prevent accidental placement on GPU nodes.
Use node selectors or node affinity to match model memory requirements to GPU profiles.
Use topology spread and zone labels when replicas need failure-domain separation.
Use priority classes carefully so critical serving workloads can recover during contention.
Use namespace quotas or admission policy to stop unbounded GPU requests.

Autoscaling implications

GPU autoscaling is slower and more expensive than normal pod scaling. A new node may need provisioning time, driver readiness, runtime daemon health, image pulls, and model loading before it can serve traffic.

Scaling signal	Use for
Queue wait	Interactive latency protection.
Active sequences	Runtime saturation and batching pressure.
GPU memory	Placement safety and cache pressure.
Tokens/sec	Throughput per replica and cost model.
Model load time	Cold-start budget and warm-pool sizing.

Failure modes

GPU node scales up but model loading takes longer than user-facing SLO.
Workload lands on incompatible GPU memory profile.
General workloads consume CPU and memory on GPU nodes, reducing inference capacity.
Autoscaler cannot provision because node selectors do not match any node group.
A small model and a large model share a pool and create unpredictable fragmentation.
Batch jobs consume all accelerators before interactive serving traffic arrives.

Metrics

GPU utilization, memory used, and memory fragmentation.
Model load time and readiness duration.
Queue wait and active sequences per replica.
Cost per generated token and cost per request class.

Baseline design​

Example scheduling contract​

Node pool taxonomy​

Scheduling policy​

Autoscaling implications​

Failure modes​

Metrics​

Related pages​