You can take one of these steps when a backup partially fails because of an existing lock.
Problem
When using filesystem volume backups (FSB), you observe something similar to below in the backup logs:
time="2023-03-27T04:01:04Z" level=error msg="Error backing up item" backup=velero/bk-entirecluster-daily-20230327040009 error="pod volume backup failed: running Restic backup, stderr=unable to create lock in backend: repository is already locked exclusively by PID 12576 on velero-7fdc5bff66-z88k8 by nonroot (UID 65532, GID 65532)\nlock was created at 2023-03-27 04:01:01 (2.490799836s ago)\nstorage ID 811e7acc\nthe `unlock` command can be used to remove stale locks\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:199" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=airflow-db-migrations-56c64bd87d-mk9rc
You can confirm that the volume backups failed by using the Velero CLI to describe the backup with the details flag and checking under the "Restic Backups" heading.
$ velero backup describe <backup name> --details Restic Backups: Completed: tanzu-system-dashboards/grafana-658c5dbc77-xphdz: sc-dashboard-volume, sc-datasources-volume, storage tanzu-system-ingress/envoy-l8f5f: envoy-admin, envoy-config tanzu-system-ingress/envoy-lx6jc: envoy-admin, envoy-config tanzu-system-ingress/envoy-vb24x: envoy-admin, envoy-config tanzu-system-ingress/envoy-x5vwk: envoy-admin, envoy-config tanzu-system-monitoring/alertmanager-6546bb6b6d-sl967: storage-volume tanzu-system-monitoring/prometheus-server-746cc78b85-vm6pn: storage-volume Failed: airflow/airflow-db-migrations-56c64bd87d-mk9rc: dags-data, logs-data airflow/airflow-flower-76dfc68945-f9x4b: dags-data, logs-data
Finally, you should see evidence of the prune
command being killed via signal. (This log will not be present in the backup logs but only in the velero pod logs.)
time="2023-03-27T16:49:05Z" level=warning msg="error pruning repository" error="error running command=restic prune --repo=s3:http://111.222.333.111/example-bkt/01G2Q0XXBKCJ0GM7E7V34Q85PT/restic/airflow --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=loading indexes...\nloading all snapshots...\nfinding data that is still in use for 355 snapshots\n[0:22] 100.00% 355 / 355 snapshots\n\nsearching used packs...\ncollecting packs for deletion and repacking\n[0:00] 100.00% 529 / 529 packs processed\n\n\nto repack: 106126 blobs / 98.205 MiB\nthis removes: 1540 blobs / 54.810 MiB\nto delete: 908753 blobs / 639.015 MiB\ntotal prune: 910293 blobs / 693.825 MiB\nremaining: 311670 blobs / 2.882 GiB\nunused size after prune: 439.412 MiB (14.89% of remaining size)\n\nrepacking packs\n, stderr=: signal: killed" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_manager.go:296" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryManager).exec" logSource="pkg/controller/restic_repository_controller.go:198" resticRepo=velero/airflow-msk-pure-s3-bkt-ls6pk
Cause
Velero periodically "prunes" each restic repository to compact disk space. Under the hood, Velero cleans up stale locks and spawns a child process to run the prune
command. The prune
command acquires locks but then is subsequently OOM killed most likely due to insufficient resources given to the Velero container. This cycle continues so the repo is continuously locked. And when the backup is created, Restic fails to take a snapshot due to the existing lock.
There is a velero issue tracking the fix to not repeatedly attempt the prune
command and thus prevent the Restic repository from being perpetually locked.
Solution
- ♦ You can use one of the following options to fix the issue:
Option Description Use CSI volume snapshots. You may choose to back up your volumes using CSI volume snapshots instead of filesystem level volume backups. When creating a backup, select the option to perform CSI volume snapshots and use the opt-in approach for filesystem backups.
Increase the memory limit for the velero deployment. Velero maintains a restic "repository" for each namespace. The amount of memory required to perform the
prune
command is proportional to the size of the restic repository's index. So the solution is to increase the memory limits of the Velero container in the Velero deployment until theprune
command has sufficient memory to run. (Note that theprune
command is run by the Velero pod and not thenode-agent
pods.)Use the following command to edit the deployment (you don't need to update the resources under the init containers):
$ kubectl -n velero edit deployment velero ... resources: limits: cpu: "1" memory: 512Mi <--- change this requests: cpu: 500m memory: 128Mi ...
Wait for a few minutes and check if the
prune
errors are still seen:$ kubectl -n velero logs <velero pod name> | grep prune | grep killed
If they're still present, try again by increasing the memory limits to a higher value.