Backups Partially Fail When Volumes Present on Control Plane Nodes

You can take these steps when a backup partially fails when volumes are present on the control plane nodes.

Problem

When using filesystem level volume backups (FSB), the overall backup ends up partially failed and a log similar to the following is present:

time="2023-07-18T07:48:39Z" level=error msg="Error backing up item" backup=velero/bk-2 error="daemonset pod not found in running state in node interop-fresh-kind124-control-plane" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/nodeagent/node_agent.go:74" error.function=github.com/vmware-tanzu/velero/pkg/nodeagent.IsRunningInNode logSource="pkg/backup/backup.go:425" name=app-0

Cause

Velero's filesystem backup integration works by creating a daemonset called node-agent and each of its pods is responsible for backing up volumes on its own node. By default, daemonset pods are only scheduled on worker nodes. This is expected behavior on most clusters since the control plane nodes are tainted by default.

In the unlikely scenario that there are pods with volumes present on the control plane nodes, there is no daemonset pod present to complete the volume backup and so these volumes are skipped and the overall backup partially fails. Note that this includes even empty Dir volumes.

Solution

♦ Use one of the following options to address the issue:

Note:

The following assumes that you have configured filesystem backups with the opt-out option.

Option	Description
Use CSI volume snapshots.	CSI volume snapshots do not have any such node level limitation as they do not rely on the node-agent daemonset. While creating a backup, select the option to perform CSI volume snapshots and use the opt-in approach for filesystem backups.
Exclude the namespace containing the problematic volumes.	In the Create Backup page, under Advanced options, add the namespace under Excluded namespaces. Note that this prevents all resources in the namespace from being backed up and not just volumes.
Exclude the problematic pod volumes.	Annotate the pods containing the problematic volumes with `backup.velero.io/backup-volumes-excludes=vol1,vol2,vol3`. Note that the volume names in the annotation would be those in the pod manifest under `.spec.volumes` and not the PVC or PV names. Also, in case the pods were created via another resource such as a deployment or daemonset, it may be a good idea to configure the parent resource such that spawned pods receive the annotation (`.spec.template.metadata.annotations` in case of deployments/daemonsets). This ensures that the annotation persists even if the pod is recreated.
Exclude the problematic pods.	Similar to the above option, you could exclude the pod plus all its volumes from being backed up by labeling it with `velero.io/exclude-from-backup=true`.
Tolerate the taint for the node-agent daemonset.	Edit the daemonset with: `kubectl -n velero edit ds node-agent` Add the following under `.spec.template.spec`. `tolerations: - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule` Note that the exact taint key used could differ across clusters. In the end, verify that the daemonset pods were spawned on every node in the cluster.