Kubernetes Disaster Recovery
Implement robust disaster recovery strategies for Kubernetes clusters
Kubernetes Disaster Recovery
Implementing robust disaster recovery (DR) strategies is crucial for maintaining business continuity. This guide covers essential DR practices for Kubernetes.
Video Tutorial
Learn more about Kubernetes disaster recovery in this comprehensive video tutorial:
Prerequisites
- Basic understanding of Kubernetes
- Access to a Kubernetes cluster
- kubectl CLI tool installed
- Familiarity with backup concepts
Project Structure
.
├── disaster-recovery/
│ ├── backup/ # Backup configurations
│ ├── restore/ # Restore procedures
│ ├── drills/ # DR test scenarios
│ └── policies/ # DR policies
└── monitoring/
├── backup-metrics/ # Backup monitoring
└── alerts/ # DR alert configs
Backup Configuration
1. Velero Setup
apiVersion: velero.io/v1
kind: Backup
metadata:
name: daily-backup
spec:
includedNamespaces:
- "*"
excludedNamespaces:
- kube-system
schedule: "0 1 * * *"
ttl: 720h
hooks:
resources:
- name: backup-hook
includedNamespaces:
- default
2. Persistent Volume Backup
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-hostpath-snapclass
driver: hostpath.csi.k8s.io
deletionPolicy: Delete
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: data-snapshot
spec:
volumeSnapshotClassName: csi-hostpath-snapclass
source:
persistentVolumeClaimName: data-pvc
Cluster Recovery
1. Cluster Backup
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: default
namespace: velero
spec:
provider: aws
objectStorage:
bucket: my-backup-bucket
config:
region: us-west-1
2. Recovery Plan
apiVersion: velero.io/v1
kind: Restore
metadata:
name: restore-production
spec:
backupName: daily-backup
includedNamespaces:
- "*"
restorePVs: true
High Availability Configuration
1. Multi-Zone Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ha-app
spec:
replicas: 3
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: ha-app
2. Pod Anti-Affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: dr-ready-app
spec:
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- dr-ready-app
topologyKey: "kubernetes.io/hostname"
Monitoring and Alerts
1. Backup Monitoring
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: backup-monitor
spec:
selector:
matchLabels:
app: velero
endpoints:
- port: metrics
2. DR Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: dr-alerts
spec:
groups:
- name: disaster-recovery
rules:
- alert: BackupFailure
expr: velero_backup_failure_total > 0
for: 1h
labels:
severity: critical
annotations:
description: Backup has failed
DR Testing Procedures
1. Test Job
apiVersion: batch/v1
kind: Job
metadata:
name: dr-test
spec:
template:
spec:
containers:
- name: dr-test
image: dr-test:latest
command: ["./test-recovery.sh"]
restartPolicy: Never
2. Validation Steps
apiVersion: v1
kind: ConfigMap
metadata:
name: dr-validation
data:
validate.sh: |
#!/bin/bash
# Validate restored services
kubectl get services
# Validate restored data
kubectl exec db-0 -- mysql -e "SELECT COUNT(*) FROM users"
Best Practices Checklist
- Regular backups
- Test recovery procedures
- Multi-zone deployment
- Data replication
- Monitoring setup
- Alert configuration
- Documentation
- Team training
- Regular drills
- Policy updates
Recovery Time Objectives
Critical Services
- RTO: 1 hour
- RPO: 5 minutes
- Automated recovery
- Active-active setup
Non-Critical Services
- RTO: 4 hours
- RPO: 1 hour
- Manual recovery
- Active-passive setup
DR Scenarios
1. Zone Failure
- Automatic failover
- Multi-zone deployment
- Load balancer updates
- DNS updates
2. Region Failure
- Cross-region recovery
- Data replication
- DNS failover
- Application migration
3. Data Corruption
- Point-in-time recovery
- Backup restoration
- Data validation
- Service verification
Common DR Pitfalls
- Untested backups
- Missing documentation
- Incomplete coverage
- Poor monitoring
- Inadequate testing
DR Documentation
1. Recovery Runbook
apiVersion: v1
kind: ConfigMap
metadata:
name: dr-runbook
data:
steps.md: |
# Recovery Steps
1. Assess the failure
2. Initiate recovery plan
3. Restore from backup
4. Validate services
5. Update DNS
6. Notify stakeholders
2. Contact Information
apiVersion: v1
kind: ConfigMap
metadata:
name: dr-contacts
data:
contacts.yaml: |
teams:
infrastructure:
primary: "oncall@company.com"
secondary: "backup@company.com"
database:
primary: "db-oncall@company.com"
Conclusion
Implementing these disaster recovery practices ensures business continuity and minimal data loss in case of failures. Regular testing and updates of DR procedures are essential for maintaining readiness.