Workloads And Scheduling
Most production failures are not caused by choosing the wrong workload primitive. They come from missing readiness semantics, weak resource contracts, or placement rules that hide failure domains.
Choose the primitive
| Need | Kubernetes primitive |
|---|---|
| Stateless service | Deployment |
| Stable identity and ordered rollout | StatefulSet |
| Run to completion | Job |
| Scheduled run | CronJob |
| Node-local agent | DaemonSet |
Scheduling controls
- Requests and limits define the scheduler contract. Requests are capacity reservation; limits are runtime enforcement.
- Taints and tolerations keep special nodes, especially GPU or system nodes, from accepting accidental workloads.
- Node affinity selects hardware, zones, compliance domains, or accelerator profiles.
- Pod topology spread prevents all replicas from landing in one failure domain.
- PriorityClass protects platform-critical workloads during pressure.
- PodDisruptionBudget limits voluntary disruptions but can block node maintenance if set unrealistically.
Senior guidance
For critical services, review workload design as a tuple:
replica strategy + resource requests + probes + topology spread + PDB + rollout policy
Do not tune autoscaling before requests and probes are trustworthy. Bad requests corrupt scheduling; bad probes corrupt availability.
Failure modes
- Pending pods because requested resources do not fit any node profile.
- Rollout deadlock because readiness never becomes true and maxUnavailable is too strict.
- Zone outage impact because topology spread was not configured.
- Node drain blocked by overly strict PDBs.