Repository / Resiliency and DR /Architecture Pattern /Kubernetes Backup Strategy: What Actually Needs Protecting

Kubernetes Backup Strategy: What Actually Needs Protecting

Kubernetes Backup Strategy: What Actually Needs Protecting

The first instinct when someone says "back up the Kubernetes cluster" is to reach for Velero, snapshot everything, and call it done. That works for simple clusters. For production infrastructure running stateful workloads -- databases, identity providers, registries, CI/CD platforms -- it's not enough, and in some cases it's the wrong approach entirely.

The real question isn't "how do I back up Kubernetes?" It's "what actually needs protecting, and what's recoverable without a backup?"

The three categories

Everything in a Kubernetes cluster falls into one of three categories:

1. Rebuildable from code (don't back up)

If your infrastructure is GitOps-driven (and it should be), most of the cluster is rebuildable from Git:

  • Deployments, Services, ConfigMaps, Secrets -- All defined in manifests, stored in Git, applied by ArgoCD or Flux.
  • Helm releases -- Chart versions and values files live in Git.
  • Namespaces, RBAC, NetworkPolicies -- All declarative, all in Git.
  • The nodes themselves -- Worker nodes are cattle, not pets. If a node dies, drain it, replace it, join the new node to the cluster.

This is the majority of your cluster by object count. And none of it needs traditional backup. If you lose the entire cluster, you stand up new nodes, install RKE2, point ArgoCD at your Git repo, and sync. The applications come back.

The prerequisite: This only works if everything is actually in Git. If someone kubectl apply'd a manifest by hand and never committed it, that config is gone when the cluster is gone. GitOps discipline is your backup strategy for cluster state.

2. Recoverable with effort (back up if practical)

Some things can be rebuilt but the effort is significant enough that having a backup saves hours or days:

  • etcd -- Contains all cluster state. If you lose all three control plane nodes simultaneously (rare but possible), an etcd backup lets you restore the cluster without rebuilding every application. RKE2 handles etcd snapshots automatically if configured.
  • Cert-manager certificates and keys -- Let's Encrypt will re-issue certificates, but if you hit rate limits, you're waiting. Backing up cert-manager secrets avoids this.
  • Persistent Volume Claims (data volumes) -- If the data can be regenerated (caches, indexes, build artifacts), don't bother. If regeneration takes hours, back it up.

3. Irreplaceable data (must back up)

This is the critical category. Data that, if lost, is gone forever:

  • Databases -- PostgreSQL, MySQL, MongoDB. User data, application state, transaction history.
  • Identity provider state -- Authentik or Keycloak databases. User accounts, SSO configurations, MFA enrollments. If you lose this, every user in every application loses access.
  • Git repositories -- If you're self-hosting GitLab, the repositories are the business. The GitLab database (issues, merge requests, CI/CD configs) is equally critical.
  • Container registry images -- If you're running Harbor and your pipeline-built images only exist there, losing the registry means you can't deploy until you rebuild every image.
  • Secrets and encryption keys -- Anything that encrypts data at rest. Lose the key, lose the data.

Backup architecture

CronJobs for application-aware backups

Kubernetes CronJobs are the right primitive for scheduled backups. They run inside the cluster, have access to internal services, and produce logs you can monitor.

The pattern:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: gitlab-full-backup
  namespace: backups
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backupOffLimitSeconds: 3600
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: backup
              image: your-backup-image:latest
              command: ["/scripts/backup-gitlab.sh"]
              env:
                - name: BACKUP_DEST
                  value: "/backups/gitlab"
                - name: PG_HOST
                  value: "gitlab-postgresql-rw.gitlab.svc"
              volumeMounts:
                - name: backup-storage
                  mountPath: /backups
          volumes:
            - name: backup-storage
              persistentVolumeClaim:
                claimName: backup-nfs-pvc

The backup script inside the container does the application-specific work: pg_dump for PostgreSQL, gitlab-backup create for GitLab, API calls for Harbor.

What each service needs

PostgreSQL (via CloudNativePG):

CloudNativePG has built-in backup to object storage. Configure it in the Cluster spec:

backup:
  barmanObjectStore:
    destinationPath: s3://backups/cnpg/
    endpointURL: https://minio.yourdomain.com
    s3Credentials:
      accessKeyId:
        name: cnpg-backup-secret
        key: ACCESS_KEY_ID
      secretAccessKey:
        name: cnpg-backup-secret
        key: SECRET_ACCESS_KEY
  retentionPolicy: "14d"

This gives you continuous WAL archiving and scheduled base backups. Point-in-time recovery to any moment in the retention window. This is the gold standard for PostgreSQL backup.

For databases managed outside CloudNativePG, a CronJob running pg_dump to NFS or object storage works:

pg_dump -h $PG_HOST -U $PG_USER -d $PG_DB -Fc \
  -f /backups/authentik/authentik-$(date +%Y%m%d-%H%M%S).dump

GitLab:

GitLab has its own backup rake task that captures repositories, database, uploads, and CI artifacts:

gitlab-backup create STRATEGY=copy SKIP=artifacts,registry

The SKIP=registry is intentional -- Harbor handles container image storage separately. Including the registry in GitLab backup would duplicate data.

Harbor:

Harbor's database and registry storage need separate backup:

  • Database: pg_dump of the Harbor PostgreSQL instance
  • Registry blobs: If stored on Ceph, rely on Ceph-level replication and snapshots. If on local storage, copy to NFS.

Authentik:

A PostgreSQL dump captures everything. Authentik stores all state in the database -- flows, providers, applications, user accounts. The Redis data is ephemeral and doesn't need backup.

Backup destination: The 3-2-1 rule still applies

Three copies. Two different media types. One offsite.

In practice for a self-hosted Kubernetes cluster:

  1. Primary data -- Ceph (replicated 3x across OSDs)
  2. Backup copy 1 -- NFS mount on separate storage (different failure domain from Ceph)
  3. Backup copy 2 -- Object storage (MinIO on a different host, or cloud S3)

The NFS target should not be on the same Ceph cluster as the production data. If Ceph has a catastrophic failure (all monitors down, quorum lost), both your production data and backups would be affected. A dedicated NFS server -- even an old machine with spinning disks -- provides the failure domain separation you need.

Testing restores

This is the part everyone skips. A backup that hasn't been tested is a backup that doesn't exist.

Build restore testing into your operational cadence:

  1. Monthly: Pick a service at random. Restore its database to a test namespace. Verify the data is intact and the application starts.
  2. Quarterly: Simulate a full cluster rebuild. Stand up a new cluster (or use a separate namespace), point ArgoCD at the repo, restore databases from backup. Time how long it takes. That's your actual RTO.
  3. After every backup pipeline change: Run an immediate restore test. Changes to backup scripts, storage targets, or schedules should be validated immediately.

A simple restore test CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-verify
  namespace: backups
spec:
  schedule: "0 6 * * 0"  # Sunday 6 AM
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: verify
              image: your-backup-image:latest
              command: ["/scripts/verify-backup.sh"]

The verify script lists the most recent backup for each service, checks file size (a 0-byte dump means the backup failed silently), and optionally does a pg_restore --list to validate the dump format. Send the results to your monitoring system.

Monitoring backup health

Backups fail silently. The CronJob runs, the pod crashes, and nobody notices until they need to restore.

At minimum, monitor:

  • CronJob last successful run -- Prometheus can scrape kube_cronjob_status_last_successful_time. Alert if any backup job hasn't succeeded in 36 hours (assuming a 24-hour schedule).
  • Backup file age and size -- A script that checks the most recent backup file on the NFS target. Alert if it's older than expected or suspiciously small.
  • Pod failure count -- If the backup CronJob pod is failing, you want to know immediately, not at restore time.

What you can afford to lose

Not everything deserves the same backup treatment. A simple framework:

| Data | RPO | Backup method | Recovery | |------|-----|---------------|----------| | PostgreSQL (identity, GitLab) | 1 hour | WAL archiving + daily base backup | Point-in-time restore | | Git repositories | 24 hours | GitLab backup task, daily | Restore from dump | | Container images | Rebuildable | CI/CD pipeline | Rebuild from source | | Cluster state | Rebuildable | GitOps (ArgoCD + Git) | Sync from repo | | etcd | 6 hours | RKE2 automatic snapshots | Restore to new control plane | | Monitoring data | Disposable | None | Starts fresh |

The RPO (Recovery Point Objective) for each category determines your backup frequency. Identity data with a 1-hour RPO needs continuous WAL archiving. Monitoring data with no RPO needs no backup at all.

The bottom line

Kubernetes backup isn't about snapshotting the cluster. It's about identifying the irreplaceable data, building automated pipelines to protect it, verifying those pipelines actually work, and accepting that most of the cluster is rebuildable from Git.

Back up your databases. Back up your identity provider. Back up your Git repos. Let GitOps handle everything else. And test your restores.