Kubernetes Disaster Recovery

Implement robust disaster recovery strategies for Kubernetes clusters

Kubernetes Disaster Recovery

Implementing robust disaster recovery (DR) strategies is crucial for maintaining business continuity. This guide covers essential DR practices for Kubernetes.

Video Tutorial

Learn more about Kubernetes disaster recovery in this comprehensive video tutorial:

View Source Code

Prerequisites

  • Basic understanding of Kubernetes
  • Access to a Kubernetes cluster
  • kubectl CLI tool installed
  • Familiarity with backup concepts

Project Structure

.
├── disaster-recovery/
│   ├── backup/           # Backup configurations
│   ├── restore/          # Restore procedures
│   ├── drills/          # DR test scenarios
│   └── policies/        # DR policies
└── monitoring/
    ├── backup-metrics/  # Backup monitoring
    └── alerts/         # DR alert configs

Backup Configuration

1. Velero Setup

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: daily-backup
spec:
  includedNamespaces:
  - "*"
  excludedNamespaces:
  - kube-system
  schedule: "0 1 * * *"
  ttl: 720h
  hooks:
    resources:
      - name: backup-hook
        includedNamespaces:
        - default

2. Persistent Volume Backup

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-hostpath-snapclass
driver: hostpath.csi.k8s.io
deletionPolicy: Delete
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: data-snapshot
spec:
  volumeSnapshotClassName: csi-hostpath-snapclass
  source:
    persistentVolumeClaimName: data-pvc

Cluster Recovery

1. Cluster Backup

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: my-backup-bucket
  config:
    region: us-west-1

2. Recovery Plan

apiVersion: velero.io/v1
kind: Restore
metadata:
  name: restore-production
spec:
  backupName: daily-backup
  includedNamespaces:
  - "*"
  restorePVs: true

High Availability Configuration

1. Multi-Zone Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ha-app
spec:
  replicas: 3
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: ha-app

2. Pod Anti-Affinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dr-ready-app
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - dr-ready-app
            topologyKey: "kubernetes.io/hostname"

Monitoring and Alerts

1. Backup Monitoring

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: backup-monitor
spec:
  selector:
    matchLabels:
      app: velero
  endpoints:
  - port: metrics

2. DR Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: dr-alerts
spec:
  groups:
  - name: disaster-recovery
    rules:
    - alert: BackupFailure
      expr: velero_backup_failure_total > 0
      for: 1h
      labels:
        severity: critical
      annotations:
        description: Backup has failed

DR Testing Procedures

1. Test Job

apiVersion: batch/v1
kind: Job
metadata:
  name: dr-test
spec:
  template:
    spec:
      containers:
      - name: dr-test
        image: dr-test:latest
        command: ["./test-recovery.sh"]
      restartPolicy: Never

2. Validation Steps

apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-validation
data:
  validate.sh: |
    #!/bin/bash
    # Validate restored services
    kubectl get services
    # Validate restored data
    kubectl exec db-0 -- mysql -e "SELECT COUNT(*) FROM users"

Best Practices Checklist

  1. Regular backups
  2. Test recovery procedures
  3. Multi-zone deployment
  4. Data replication
  5. Monitoring setup
  6. Alert configuration
  7. Documentation
  8. Team training
  9. Regular drills
  10. Policy updates

Recovery Time Objectives

Critical Services

  • RTO: 1 hour
  • RPO: 5 minutes
  • Automated recovery
  • Active-active setup

Non-Critical Services

  • RTO: 4 hours
  • RPO: 1 hour
  • Manual recovery
  • Active-passive setup

DR Scenarios

1. Zone Failure

  • Automatic failover
  • Multi-zone deployment
  • Load balancer updates
  • DNS updates

2. Region Failure

  • Cross-region recovery
  • Data replication
  • DNS failover
  • Application migration

3. Data Corruption

  • Point-in-time recovery
  • Backup restoration
  • Data validation
  • Service verification

Common DR Pitfalls

  1. Untested backups
  2. Missing documentation
  3. Incomplete coverage
  4. Poor monitoring
  5. Inadequate testing

DR Documentation

1. Recovery Runbook

apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-runbook
data:
  steps.md: |
    # Recovery Steps
    1. Assess the failure
    2. Initiate recovery plan
    3. Restore from backup
    4. Validate services
    5. Update DNS
    6. Notify stakeholders

2. Contact Information

apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-contacts
data:
  contacts.yaml: |
    teams:
      infrastructure:
        primary: "oncall@company.com"
        secondary: "backup@company.com"
      database:
        primary: "db-oncall@company.com"

Conclusion

Implementing these disaster recovery practices ensures business continuity and minimal data loss in case of failures. Regular testing and updates of DR procedures are essential for maintaining readiness.

Additional Resources