You can take these steps when a backup partially fails when volumes are present on the control plane nodes.

Problem

When using filesystem level volume backups (FSB), the overall backup ends up partially failed and a log similar to the following is present:

time="2023-07-18T07:48:39Z" level=error msg="Error backing up item" backup=velero/bk-2 error="daemonset pod not found in running state in node interop-fresh-kind124-control-plane" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/nodeagent/node_agent.go:74" error.function=github.com/vmware-tanzu/velero/pkg/nodeagent.IsRunningInNode logSource="pkg/backup/backup.go:425" name=app-0

Cause

Velero's filesystem backup integration works by creating a daemonset called node-agent and each of its pods is responsible for backing up volumes on its own node. By default, daemonset pods are only scheduled on worker nodes. This is expected behavior on most clusters since the control plane nodes are tainted by default.

In the unlikely scenario that there are pods with volumes present on the control plane nodes, there is no daemonset pod present to complete the volume backup and so these volumes are skipped and the overall backup partially fails. Note that this includes even empty Dir volumes.

Solution

  • Use one of the following options to address the issue:
    Note:

    The following assumes that you have configured filesystem backups with the opt-out option.

    Option Description
    Use CSI volume snapshots.

    CSI volume snapshots do not have any such node level limitation as they do not rely on the node-agent daemonset. While creating a backup, select the option to perform CSI volume snapshots and use the opt-in approach for filesystem backups.

    Exclude the namespace containing the problematic volumes.

    In the Create Backup page, under Advanced options, add the namespace under Excluded namespaces. Note that this prevents all resources in the namespace from being backed up and not just volumes.

    Exclude the problematic pod volumes.

    Annotate the pods containing the problematic volumes with backup.velero.io/backup-volumes-excludes=vol1,vol2,vol3. Note that the volume names in the annotation would be those in the pod manifest under .spec.volumes and not the PVC or PV names. Also, in case the pods were created via another resource such as a deployment or daemonset, it may be a good idea to configure the parent resource such that spawned pods receive the annotation (.spec.template.metadata.annotations in case of deployments/daemonsets). This ensures that the annotation persists even if the pod is recreated.

    Exclude the problematic pods.

    Similar to the above option, you could exclude the pod plus all its volumes from being backed up by labeling it with velero.io/exclude-from-backup=true.

    Tolerate the taint for the node-agent daemonset.

    Edit the daemonset with:

    kubectl -n velero edit ds node-agent
    

    Add the following under .spec.template.spec.

    tolerations:
    	  - key: node-role.kubernetes.io/control-plane
    	    operator: Exists
    	    effect: NoSchedule
    	  - key: node-role.kubernetes.io/master
    	    operator: Exists
    	    effect: NoSchedule
    

    Note that the exact taint key used could differ across clusters. In the end, verify that the daemonset pods were spawned on every node in the cluster.