Skip to main content

Backup and Disaster Recovery

Disaster recovery is not a tool. It is an agreement about data loss, downtime, restore order, and who can execute the runbook.

What to back up

Scope	Why
etcd	Control plane state and Kubernetes objects.
Application data	Databases, queues, object storage, and external systems.
Kubernetes manifests	Desired state for rebuild and drift recovery.
Secrets	Required for restore, but must be protected and rotated.
Platform services	Argo CD, monitoring, policy, ingress, certificate state.

RPO and RTO

RPO answers how much data loss is acceptable.
RTO answers how long restore may take.
Both must be measured by restore drills, not estimated from tool documentation.

Restore order

Restore or rebuild control plane.
Restore platform services required for deployment and policy.
Restore secrets and external dependencies.
Restore stateful workloads.
Restore stateless workloads.
Validate traffic, data consistency, and telemetry.

Failure modes

Backups exclude CRDs, so restored custom resources cannot be understood.
Backup tool has cluster credentials but no tested restore process.
Object storage bucket is in the same failure domain as the cluster.
Restore succeeds technically but application data is inconsistent.

What to back up
RPO and RTO
Restore order
Failure modes