Backups Partially Fail Due to Existing Lock

You can take one of these steps when a backup partially fails because of an existing lock.

Problem

When using filesystem volume backups (FSB), you observe something similar to below in the backup logs:

time="2023-03-27T04:01:04Z" level=error msg="Error backing up item" backup=velero/bk-entirecluster-daily-20230327040009 error="pod volume backup failed: running Restic backup, stderr=unable to create lock in backend: repository is already locked exclusively by PID 12576 on velero-7fdc5bff66-z88k8 by nonroot (UID 65532, GID 65532)\nlock was created at 2023-03-27 04:01:01 (2.490799836s ago)\nstorage ID 811e7acc\nthe `unlock` command can be used to remove stale locks\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:199" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=airflow-db-migrations-56c64bd87d-mk9rc

You can confirm that the volume backups failed by using the Velero CLI to describe the backup with the details flag and checking under the "Restic Backups" heading.

$ velero backup describe <backup name> --details
 
Restic Backups:
  Completed:
    tanzu-system-dashboards/grafana-658c5dbc77-xphdz: sc-dashboard-volume, sc-datasources-volume, storage
    tanzu-system-ingress/envoy-l8f5f: envoy-admin, envoy-config
    tanzu-system-ingress/envoy-lx6jc: envoy-admin, envoy-config
    tanzu-system-ingress/envoy-vb24x: envoy-admin, envoy-config
    tanzu-system-ingress/envoy-x5vwk: envoy-admin, envoy-config
    tanzu-system-monitoring/alertmanager-6546bb6b6d-sl967: storage-volume
    tanzu-system-monitoring/prometheus-server-746cc78b85-vm6pn: storage-volume
  Failed:
    airflow/airflow-db-migrations-56c64bd87d-mk9rc: dags-data, logs-data
    airflow/airflow-flower-76dfc68945-f9x4b: dags-data, logs-data

Finally, you should see evidence of the prune command being killed via signal. (This log will not be present in the backup logs but only in the velero pod logs.)

time="2023-03-27T16:49:05Z" level=warning msg="error pruning repository" error="error running command=restic prune --repo=s3:http://111.222.333.111/example-bkt/01G2Q0XXBKCJ0GM7E7V34Q85PT/restic/airflow --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=loading indexes...\nloading all snapshots...\nfinding data that is still in use for 355 snapshots\n[0:22] 100.00%  355 / 355 snapshots\n\nsearching used packs...\ncollecting packs for deletion and repacking\n[0:00] 100.00%  529 / 529 packs processed\n\n\nto repack:        106126 blobs / 98.205 MiB\nthis removes:       1540 blobs / 54.810 MiB\nto delete:        908753 blobs / 639.015 MiB\ntotal prune:      910293 blobs / 693.825 MiB\nremaining:        311670 blobs / 2.882 GiB\nunused size after prune: 439.412 MiB (14.89% of remaining size)\n\nrepacking packs\n, stderr=: signal: killed" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_manager.go:296" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryManager).exec" logSource="pkg/controller/restic_repository_controller.go:198" resticRepo=velero/airflow-msk-pure-s3-bkt-ls6pk

Cause

Velero periodically "prunes" each restic repository to compact disk space. Under the hood, Velero cleans up stale locks and spawns a child process to run the prunecommand. The prune command acquires locks but then is subsequently OOM killed most likely due to insufficient resources given to the Velero container. This cycle continues so the repo is continuously locked. And when the backup is created, Restic fails to take a snapshot due to the existing lock.

There is a velero issue tracking the fix to not repeatedly attempt the prune command and thus prevent the Restic repository from being perpetually locked.

Solution

♦ You can use one of the following options to fix the issue:

Option Description

Use CSI volume snapshots.

Option	Description
Use CSI volume snapshots.	You may choose to back up your volumes using CSI volume snapshots instead of filesystem level volume backups. When creating a backup, select the option to perform CSI volume snapshots and use the opt-in approach for filesystem backups.
Increase the memory limit for the velero deployment.	Velero maintains a restic "repository" for each namespace. The amount of memory required to perform the `prune` command is proportional to the size of the restic repository's index. So the solution is to increase the memory limits of the Velero container in the Velero deployment until the `prune` command has sufficient memory to run. (Note that the `prune` command is run by the Velero pod and not the `node-agent` pods.) Use the following command to edit the deployment (you don't need to update the resources under the init containers): `$` `kubectl -n velero edit deployment velero ... resources: limits: cpu: "1" memory: 512Mi <--- change this requests: cpu: 500m memory: 128Mi ...` Wait for a few minutes and check if the `prune` errors are still seen: `$` `kubectl -n velero logs <velero pod name> \| grep prune \| grep killed` If they're still present, try again by increasing the memory limits to a higher value.

You may choose to back up your volumes using CSI volume snapshots instead of filesystem level volume backups. When creating a backup, select the option to perform CSI volume snapshots and use the opt-in approach for filesystem backups.

Increase the memory limit for the velero deployment.

Velero maintains a restic "repository" for each namespace. The amount of memory required to perform the prune command is proportional to the size of the restic repository's index. So the solution is to increase the memory limits of the Velero container in the Velero deployment until the prune command has sufficient memory to run. (Note that the prune command is run by the Velero pod and not the node-agent pods.)

Use the following command to edit the deployment (you don't need to update the resources under the init containers):

$ kubectl -n velero edit deployment velero
...
	    resources:
	      limits:
	        cpu: "1"
	        memory: 512Mi <--- change this
	      requests:
	        cpu: 500m
	        memory: 128Mi
...

Wait for a few minutes and check if the prune errors are still seen:

$ kubectl -n velero logs <velero pod name> | grep prune | grep killed

If they're still present, try again by increasing the memory limits to a higher value.