Kubernetes Backup: Velero, etcd Snapshots, and Disaster Recovery
Part of the Cybersecurity Skills Guide — This article is one deep-dive in our complete guide series.
By HADESS Team | February 28, 2026 | Updated: February 28, 2026 | 5 min read
Your cluster will fail. Hardware dies, someone runs kubectl delete namespace production, or a bad deployment corrupts persistent data. Without tested backups, you are rebuilding from scratch. Here is how to set up Kubernetes backup properly.
etcd Snapshots: Backing Up Cluster State
All Kubernetes state lives in etcd. Losing etcd means losing every resource definition in your cluster — deployments, services, config maps, secrets, RBAC rules, everything.
Take regular etcd snapshots:
“bash ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key `
Verify every snapshot after creation with etcdctl snapshot status. An unverified backup is not a backup. Store snapshots off-cluster — S3, GCS, or any object storage with versioning enabled. If your cluster is gone, your backup location needs to still be accessible.
For managed Kubernetes (EKS, GKE, AKS), the provider handles etcd. But you still need to back up your workloads and persistent data.
Velero: Workload and Volume Backup
Velero backs up Kubernetes resources and persistent volumes together. It talks to cloud provider APIs for volume snapshots and stores resource manifests in object storage.
Basic setup:
`bash`
velero install \
--provider aws \
--bucket my-velero-backups \
--secret-file ./credentials-velero \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1
Create scheduled backups:
`bash`
velero schedule create daily-backup \
--schedule="0 2 *" \
--include-namespaces production,staging \
--ttl 720h
Key practices:
- Use include-namespaces rather than backing up everything. System namespaces rarely need backup from Velero since they are managed by your cluster provisioning.
- Set TTL to avoid unbounded storage growth.
- Label-based selection lets you back up specific workloads: –selector app=critical
. - Restic/Kopia integration handles volumes that do not support native snapshots (like NFS or hostPath).
Disaster Recovery Planning
Backups without recovery testing are wishful thinking. Schedule regular restore drills:
1. Spin up a test cluster (or use a separate namespace) 2. Restore from backup using velero restore create –from-backup daily-backup-20260225` 3. Verify application functionality — not just that pods are running, but that the application works end-to-end 4. Document the restore time — leadership will ask for your RTO, and you need a real number
Cross-Cluster Restore
Velero supports restoring to a different cluster as long as both clusters can access the same backup storage location. This enables:
- Migration between cloud providers or regions
- Cluster upgrades — back up the old cluster, provision a new one, restore
- Multi-region DR — maintain a warm standby cluster that can receive restores
Watch for differences in StorageClasses and Ingress configurations between source and target clusters. Use Velero’s resource mapping to handle naming differences.
Securing Your Backups
Backups contain secrets, config maps with credentials, and application data. Treat backup storage with the same security as production:
- Enable encryption at rest on your backup bucket
- Restrict access with IAM policies — only the Velero service account should write
- Enable object versioning to prevent accidental or malicious deletion
- Monitor backup completion with alerts on failures
Related Career Paths
Disaster recovery and backup strategy are core skills for Cloud Security Engineers. Assess your readiness on the skills page and identify where you need to build depth.
Next Steps
- Assess your disaster recovery skills against industry benchmarks
- Map out certifications that cover Kubernetes operations and reliability
- Use the coaching tool to build a study plan for backup and DR topics
- Search for roles that include Kubernetes operations and disaster recovery responsibilities
Related Guides in This Series
- Docker Security: Hardening Containers from Build to Runtime — HADESS | 2026
- Helm Security: Chart Signing, Repository Safety, and Template Hardening — HADESS | 2026
- Kubernetes Security: RBAC, Network Policies, and Runtime Protection
Take the Next Step
Browse 80+ skills on HADESS. Go to the browse 80+ skills on hadess on HADESS.
See your certification roadmap. Check out the see your certification roadmap.
Get started free — Create your HADESS account and access all career tools.
Frequently Asked Questions
How long does it take to learn this skill?
Most practitioners build working proficiency in 4-8 weeks of dedicated study with hands-on practice. Mastery takes longer and comes primarily through on-the-job experience.
Do I need certifications for this skill?
Certifications validate your knowledge to employers but are not strictly required. Hands-on experience and portfolio projects often carry more weight in technical interviews. Check the certification roadmap for relevant options.
What career paths use this skill?
Explore the career path explorer to see which roles require this skill and how it fits into different cybersecurity specializations.
—
HADESS Team consists of cybersecurity practitioners, hiring managers, and career strategists who have collectively spent 50+ years in the field.
