Skip to main content

Backup And Disaster Recovery

Disaster recovery is not a tool. It is an agreement about data loss, downtime, restore order, and who can execute the runbook.

What to back up

ScopeWhy
etcdControl plane state and Kubernetes objects.
Application dataDatabases, queues, object storage, and external systems.
Kubernetes manifestsDesired state for rebuild and drift recovery.
SecretsRequired for restore, but must be protected and rotated.
Platform servicesArgo CD, monitoring, policy, ingress, certificate state.

RPO and RTO

  • RPO answers how much data loss is acceptable.
  • RTO answers how long restore may take.
  • Both must be measured by restore drills, not estimated from tool documentation.

Restore order

  1. Restore or rebuild control plane.
  2. Restore platform services required for deployment and policy.
  3. Restore secrets and external dependencies.
  4. Restore stateful workloads.
  5. Restore stateless workloads.
  6. Validate traffic, data consistency, and telemetry.

Failure modes

  • Backups exclude CRDs, so restored custom resources cannot be understood.
  • Backup tool has cluster credentials but no tested restore process.
  • Object storage bucket is in the same failure domain as the cluster.
  • Restore succeeds technically but application data is inconsistent.