Back Up and Restore Management and Workload Cluster Infrastructure on vSphere (Technical Preview)

This topic explains how to back up and restore cluster infrastructure for Tanzu Kubernetes Grid (TKG) with a standalone management cluster on vSphere by:

  • Using Velero to back up and restore workload cluster objects on the standalone management cluster, and
  • Re-creating the standalone management cluster from its configuration files
Note

  • VMware does not support using Velero to back up TKG standalone management clusters.
  • If a standalone management cluster is reconfigured after it is deployed, re-creating it as described here may not recover all of its resources.

To back up and restore the workloads and dynamic storage volumes hosted on Tanzu Kubernetes Grid (TKG) workload clusters with a standalone management cluster, see Back Up and Restore Cluster Workloads.

To back up and restore vSphere with Tanzu clusters, including Supervisor Clusters and the workload clusters that they create, see Backing Up and Restoring vSphere with Tanzu in the VMware vSphere 8.0 Documentation.

Caution

Set Up Velero

You can use Velero, an open source community standard tool, to back up and restore TKG standalone management cluster infrastructure and workloads.

Velero supports a variety of storage providers to store its backups. Velero also supports:

  • Pre- and post-hooks for backup and restore to run custom processes before or after backup and restore events.
  • Excluding aspects of workload or cluster state that are not well-suited to backup/restore.

A Tanzu Kubernetes Grid subscription includes support for VMware’s tested, compatible distribution of Velero available from the Tanzu Kubernetes Grid downloads page.

To back up and restore TKG clusters, you need:

After you complete the prerequisites above, you can also use Velero to migrate workloads between clusters. For instructions, see Cluster Migration and Resource Filtering in the Velero documentation.

Install the Velero CLI

Caution

If you have already installed Velero CLI v1.9.x or earlier, as distributed with prior versions of TKG, you need to upgrade to v1.10.3. Older Velero versions do not work with the CRDs used in v1.10 and later. For information, see Upgrade Velero below.

To install the Velero CLI v1.10.3, do the following:

  1. Go to the Tanzu Kubernetes Grid downloads page and log in with your VMware Customer Connect credentials.
  2. Under Product Downloads, click Go to Downloads.
  3. Scroll to the Velero entries and download the Velero CLI .gz file for your workstation OS. Its filename starts with velero-linux-, velero-mac-, or velero-windows64-.
  4. Use the gunzip command or the extraction tool of your choice to unpack the binary:

    gzip -d <RELEASE-TARBALL-NAME>.gz
    
  5. Rename the CLI binary for your platform to velero, make sure that it is executable, and add it to your PATH.

    macOS and Linux
    1. Move the binary into the /usr/local/bin folder and rename it to velero.
    2. Make the file executable:
    chmod +x /usr/local/bin/velero
    
    Windows
    1. Create a new Program Files\velero folder and copy the binary into it.
    2. Rename the binary to velero.exe.
    3. Right-click the velero folder, select Properties > Security, and make sure that your user account has the Full Control permission.
    4. Use Windows Search to search for env.
    5. Select Edit the system environment variables and click the Environment Variables button.
    6. Select the Path row under System variables, and click Edit.
    7. Click New to add a new row and enter the path to the velero binary.

Upgrade Velero

Velero v1.10.3 uses different CRDs to v1.9.x. In addition, Velero v1.10 adopted Kopia with Restic as the uploader, which had led to several changes in the naming of components and commands, and in how Velero functions. For more information about breaking changes between v1.9.x and v1.10, see Breaking Changes in the Velero v1.10 Changelog. If you installed Velero v1.9.x with a previous version of TKG, you must upgrade Velero.

  1. Follow the procedure in Install the Velero CLI to install Velero v1.10.3.
  2. Update the CRD definitions with the Velero v1.10 binary.

    velero install --crds-only --dry-run -o yaml | kubectl apply -f -
    
  3. Update the Velero deployment and daemon set configuration to match the component renaming that happened in Velero v1.10.

    In the command below, uploader_type can be either restic or kopia.

    kubectl get deploy -n velero -ojson \
    | sed "s#\"image\"\: \"velero\/velero\:v[0-9]*.[0-9]*.[0-9]\"#\"image\"\: \"velero\/velero\:v1.10.0\"#g" \
    | sed "s#\"server\",#\"server\",\"--uploader-type=$uploader_type\",#g" \
    | sed "s#default-volumes-to-restic#default-volumes-to-fs-backup#g" \
    | sed "s#default-restic-prune-frequency#default-repo-maintain-frequency#g" \
    | sed "s#restic-timeout#fs-backup-timeout#g" \
    | kubectl apply -f -
    
  4. (Optional) If you are using the restic daemon set, rename the corresponding components.

    echo $(kubectl get ds -n velero restic -ojson) \
    | sed "s#\"image\"\: \"velero\/velero\:v[0-9]*.[0-9]*.[0-9]\"#\"image\"\: \"velero\/velero\:v1.10.0\"#g" \
    | sed "s#\"name\"\: \"restic\"#\"name\"\: \"node-agent\"#g" \
    | sed "s#\[ \"restic\",#\[ \"node-agent\",#g" \
    | kubectl apply -f -
    kubectl delete ds -n velero restic --force --grace-period 0 
    

For more information, see Upgrading to Velero 1.10 in the Velero documentation.

Set Up a Storage Provider

To back up Tanzu Kubernetes Grid workload cluster contents, you need storage locations for:

  • Cluster object storage backups for Kubernetes metadata in clusters
  • Volume snapshots for data used by clusters

See Backup Storage Locations and Volume Snapshot Locations in the Velero documentation. Velero supports a variety of storage providers, which can be either:

  • An online cloud storage provider.
  • An on-premises object storage service such as MinIO, for proxied or air-gapped environments.

VMware recommends dedicating a unique storage bucket to each cluster.

To set up MinIO:

  1. Run the minio container image with MinIO credentials and a storage location, for example:

    $ docker run -d --name minio --rm -p 9000:9000 -e "MINIO_ACCESS_KEY=minio" -e "MINIO_SECRET_KEY=minio123" -e "MINIO_DEFAULT_BUCKETS=mgmt" gcr.io/velero-gcp/bitnami/minio:2021.6.17-debian-10-r7
    
  2. Save the credentials to a local file to pass to the --secret-file option of velero install, for example:

    [default]
    aws_access_key_id=minio
    aws_secret_access_key=minio123
    

Storage for vSphere

On vSphere, cluster object storage backups and volume snapshots save to the same storage location. This location must be S3-compatible external storage on Amazon Web Services (AWS), or an S3 provider such as MinIO.

To set up storage for Velero on vSphere, see Velero Plugin for vSphere in Vanilla Kubernetes Cluster for the v1.5.1 plugin.

Storage for and on AWS

To set up storage for Velero on AWS, follow the procedures in the Velero Plugins for AWS repository:

  1. Create an S3 bucket.

  2. Set permissions for Velero.

Set up S3 storage as needed for each plugin. The object store plugin stores and retrieves cluster object backups, and the volume snapshotter stores and retrieves data volumes.

Storage for and on Azure

To set up storage for Velero on Azure, follow the procedures in the Velero Plugins for Azure repository:

  1. Create an Azure storage account and blob container

  2. Get the resource group containing your VMs and disks

  3. Set permissions for Velero

Set up S3 storage as needed for each plugin. The object store plugin stores and retrieves cluster object backups, and the volume snapshotter stores and retrieves data volumes.

Deploy Velero Server to the Management Cluster

To back up workload cluster objects, install Velero v1.10.3 server to the standalone management cluster and verify the installations.

Velero Install Options

To install Velero, run velero install with the following options:

  • --provider $PROVIDER: For example, aws
  • --plugins projects.registry.vmware.com/tkg/velero/velero-plugin-for-aws:v1.6.2_vmware.1
  • --bucket $BUCKET: The name of your S3 bucket
  • --backup-location-config region=$REGION: The AWS region the bucket is in
  • --snapshot-location-config region=$REGION: The AWS region the bucket is in
  • (Optional) --kubeconfig to install the Velero server to a cluster other than the current default.
  • (Optional) --secret-file ./VELERO-CREDS one way to give Velero access to an S3 bucket is to pass in to this option a local VELERO-CREDS file that looks like:

    [default]
    aws_access_key_id=<AWS_ACCESS_KEY_ID>
    aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
    
  • For additional options, see Install and start Velero.

Running the velero install command creates a namespace called velero on the cluster, and places a deployment named velero in it.

The following settings are required:

  • Management cluster plugin: The following plugin is required. The option pauses the clusters and collects the resources related to the clusters being backed up.
    --plugins projects.registry.vmware.com/tkg/velero/velero-mgmt-cluster-plugin:v0.2.0_vmware.1
    
    Note

    You can add multiple options separated by a comma. For example:

    --plugins projects.registry.vmware.com/tkg/velero/velero-plugin-for-aws:v1.6.2_vmware.1,projects.registry.vmware.com/tkg/velero/velero-mgmt-cluster-plugin:v0.2.0_vmware.1
    
  • No snapshot location: For backing up cluster infrastructure, do not set --snapshot-location-config

Verify Velero Installation

After the velero install command completes, verify that Velero installed successfully:

  1. Verify that the Velero pod has status Running:

    kubectl -n velero get pod
    NAME                      READY   STATUS    RESTARTS   AGE
    velero-78fdbcd446-v5cqr   1/1     Running   0          3h41m
    
  2. Verify that the backup location is in the Available phase:

    velero backup-location get
    NAME      PROVIDER   BUCKET/PREFIX   PHASE       LAST VALIDATED                  ACCESS MODE   DEFAULT
    default   aws        mgmt            Available   2022-11-11 05:55:55 +0000 UTC   ReadWrite     true
    

Back Up Workload Cluster Objects

To back up all the workload cluster objects managed by a standalone management cluster, run:

velero backup create my-backup --exclude-namespaces tkg-system --include-resources cluster.cluster.x-k8s.io --wait
Note

  • --exclude-namespaces tkg-system excludes the management cluster itself.
  • --include-resources cluster.cluster.x-k8s.io includes the workload cluster objects

  • VMware recommends backing up workload clusters immediately after making any structural changes, such as scaling up or down. This avoid a mismatch between backup objects and physical infrastructure that can make the restore process fail.

Scheduling Backups

When cluster objects are changed after the most recent backup, the state of the system after a restore does not match its desired, most recent state. This problem is called “drift”. See the Handling Drift section below for how to detect and recover from some common types of drift.

To minimize drift, VMware recommends using Velero to schedule frequent, regular backups. For example, to back up all workload clusters daily and retain each backup for 14 days:

velero create schedule daily-bak --schedule="@every 24h"  --exclude-namespaces tkg-system --include-resources cluster.cluster.x-k8s.io --ttl 336h0m0s

For more Velero scheduling options, see Schedule a Backup in the Velero documentation.

Regenerating kubeconfig Files After Restore

After you use Velero to restore a workload cluster, you need to distribute its new kubeconfig file to anyone who uses it:

  1. Regenerate the kubeconfig:

    tanzu cluster kubeconfig get CLUSTER-NAME --namespace NAMESPACE
    
  2. Distribute the output of the above command to anyone who uses the clusters, to replace their old kubeconfig file.

Complete Restore

To restore a standalone management cluster and the workload cluster objects that it manages, you re-create the management cluster from its configuration file, use Velero to restore its workload clusters, and distribute new kubeconfig files to the people who use them:

  1. If you suspect drift between the most recent backup of workload cluster objects and their currently running state, use the Drift Detector tool to generate a remediation report, as described in Using Drift Detector.

  2. Ensure that any configuration changes that were made to the management cluster after it was originally deployed are reflected in its configuration file or in environment variables. Otherwise it will not restore to its most recent state.

  3. Re-create the management cluster from its configuration file, mgmt-cluster-config.yaml here, as described in Deploy Management Clusters from a Configuration File.

    • If you deployed the management cluster or its workload clusters to multiple availability zones on vSphere as described in Running Clusters Across Multiple Availability Zones, also include the file with the VSphereFailureDomain and VSphereDeploymentZone object definitions, for example by including --az-file vsphere-zones.yaml in the tanzu mc create command.
  4. Immediately after management cluster is created, there should be only one TKR:

    tanzu kubernetes-release get
    NAME                       VERSION                  COMPATIBLE  ACTIVE  UPDATES AVAILABLE
    v1.26.8---vmware.2-tkg.1  v1.26.8+vmware.1-tkg.1  True        True
    
  5. Wait a few minutes until all of the TKRs used by backed-up workload clusters become available:

    tanzu kubernetes-release get
    NAME                       VERSION                  COMPATIBLE  ACTIVE  UPDATES AVAILABLE
    v1.24.17---vmware.2-tkg.2  v1.24.17+vmware.2-tkg.2  True        True
    v1.25.13---vmware.1-tkg.1  v1.25.13+vmware.1-tkg.1  True        True
    v1.26.8---vmware.2-tkg.1   v1.26.8+vmware.1-tkg.1   True        True
    
  6. Install Velero on the management cluster, following the Deploy Velero Server to Clusters instructions above. Make sure that the credentials and backup location configuration settings have the same values as when the backup was made.

  7. After Velero installs, run velero backup get until the backups are synchronized and the command lists the backup that you want to use:

    velero backup get
    NAME                 STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
    my-backup            Completed   0        0          2022-12-07 17:10:42 +0000 UTC   24d       default            <none>
    
  8. Run velero restore create to restore the workload cluster resources. VMware recommends using the most recent backup:

    velero restore create my-restore --from-backup my-backup --wait
    

    After the restoration complete, the clusters are in createdStalled status:

    tanzu cluster list
    NAME                NAMESPACE  STATUS          CONTROLPLANE  WORKERS  KUBERNETES         ROLES   PLAN  TKR
    tkg-vc-antrea       default    createdStalled  0/3           0/3      v1.26.8+vmware.1   <none>  prod  v1.26.8---vmware.2-tkg.1
    
  9. Patch the cluster objects to set their paused property to false. This is required because cluster objects are re-created in a paused state on the new management cluster, to prevent their controllers from trying to reconcile:

    • To unpause a cluster after it is restored, run:

      kubectl -n my-namespace patch cluster CLUSTER-NAME --type merge -p '{"spec":{"paused":false}}'
      
    • To unpause all clusters in multiple namespaces, run the script:

      #!/bin/bash
      
      for ns in $(kubectl get ns -o custom-columns=":metadata.name" | grep -v "tkg-system");
      do
            clusters=$(kubectl -n $ns get cluster -o name)
            if [[ -n $clusters ]];then
                    kubectl -n $ns patch $clusters --type merge -p '{"spec":{"paused":false}}'
            fi
      done
      
  10. Verify that all workload clusters are in the running state, for example:

    tanzu cluster list
    NAME                NAMESPACE  STATUS   CONTROLPLANE  WORKERS  KUBERNETES         ROLES   PLAN  TKR
    tkg-vc-antrea       default    running  3/3           3/3      v1.26.8+vmware.1   <none>  prod  v1.26.8---vmware.2-tkg.1
    
  11. For each workload cluster, run tanzu cluster get CLUSTER-NAME to check that all components are in the running state, for example:

    tanzu cluster get tkg-vc-antrea
      NAME           NAMESPACE  STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES   TKR
      tkg-vc-antrea  default    running  3/3           3/3      v1.26.8+vmware.1 <none>  v1.26.8---vmware.2-tkg.1
    
    Details:
    
    NAME                                                          READY  SEVERITY  REASON  SINCE  MESSAGE
    /tkg-vc-antrea                                                True                     4h14m
    ├─ClusterInfrastructure - VSphereCluster/tkg-vc-antrea-s6kl5  True                     4h36m
    ├─ControlPlane - KubeadmControlPlane/tkg-vc-antrea-ch5hn      True                     4h14m
    │ ├─Machine/tkg-vc-antrea-ch5hn-8gfvt                         True                     4h14m
    │ ├─Machine/tkg-vc-antrea-ch5hn-vdcrp                         True                     4h23m
    │ └─Machine/tkg-vc-antrea-ch5hn-x7nmm                         True                     4h32m
    └─Workers
      ├─MachineDeployment/tkg-vc-antrea-md-0-8b8zn                True                     4h23m
      │ └─Machine/tkg-vc-antrea-md-0-8b8zn-798d5b8897-bnxn9       True                     4h24m
      ├─MachineDeployment/tkg-vc-antrea-md-1-m6dvh                True                     4h24m
      │ └─Machine/tkg-vc-antrea-md-1-m6dvh-79fb858b96-p9667       True                     4h28m
      └─MachineDeployment/tkg-vc-antrea-md-2-brm2m                True                     4h21m
        └─Machine/tkg-vc-antrea-md-2-brm2m-6478cffc5f-tq5cn       True                     4h23m
    

    After all workload clusters are running, you can manage the workload clusters with the Tanzu CLI.

  12. If you ran Drift Detector before you re-created the management cluster, manually remediate or investigate any objects flagged in the Drift Detector report as described in Remediate Drift.

  13. Regenerate and distribute new kubeconfig files for the management cluster and its workload clusters:

    1. Regenerate the management cluster kubeconfig:

      tanzu management-cluster kubeconfig get
      
    2. For each workload cluster, regenerate its kubeconfig:

      tanzu cluster kubeconfig get CLUSTER-NAME --namespace NAMESPACE
      
    3. Distribute the outputs of the above commands to anyone who uses the clusters, to replace their old kubeconfig files.

Handling Drift

Drift occurs when cluster objects have changed since their most recent backup, and so the state of the system after a restore does not match its desired, most recent state.

To minimize drift, VMware recommends scheduling frequent, regular backups.

To help detect and remediate drift, you can use the Drift Detector tool described in the sections below.

Using Drift Detector

Drift Detector is a command-line tool that:

  • Compares the content of a TKG backup with the current state of TKG cluster object infrastructure, and
  • Generates a report that lists potential issues and steps for remediating drift
Important

Drift Detector is in the unsupported Experimental state. Drift is complicated, and the Drift Detector may not detect all instances of drift. It should only be used as a reference, and never as a substitute for regular backups.

For how to install and use Drift Detector, see Drift Detector for Tanzu Kubernetes Grid Management Cluster on VMware KB website. The overall process is:

  1. Before you restore TKG from backup, run the drift-detector command to generate a report.

  2. Download and restore TKG from the most recent backup.

  3. Referring to the Drift Detector report, follow the guidance in Remediating Drift to take remediation actions on the restored state of TKG.

Remediating Drift

Drift cases can be complicated, but if you have a Drift Detector report or otherwise detect some drift in your cluster object state since the last backup, you can remediate some common patterns as follows:

  • Stale worker nodes:

    • Extra, unused nodes
    • Can occur if worker node count was scaled down after backup
    • Mitigation often unnecessary. After restore, Machine Health Check deletes the stale machine objects and a new nodes are created to meet the desired machine count.
  • Ghost worker node infrastructure:

    • Superfluous, unmanaged node infrastructure
    • Can occur if worker node count was scaled up after backup
    • Mitigation:

      1. Get the workload cluster kubeconfig and set it as the kubectl context.
      2. Compare the output of the following kubectl and tanzu commands:

        # Get the actual worker nodes of the workload cluster
        $ kubectl --context tkg-vc-antrea-admin@tkg-vc-antrea get node
        NAME                                        STATUS   ROLES           AGE    VERSION
        tkg-vc-antrea-md-0-p9vn5-645498f59f-42qh9   Ready    <none>          44m    v1.26.8+vmware.1
        tkg-vc-antrea-md-0-p9vn5-645498f59f-shrpt   Ready    <none>          114m   v1.26.8+vmware.1
        tkg-vc-antrea-wdsfx-2hkxp                   Ready    control-plane   116m   v1.26.8+vmware.1
        
        # Get the worker nodes managed by the TKG
        $ tanzu cluster get tkg-vc-antrea
          NAME           NAMESPACE  STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES   TKR
          tkg-vc-antrea  default    running  1/1           1/1      v1.26.8+vmware.1  <none>  v1.26.8---vmware.2-tkg.1-zshippable
        
          Details:
        
          NAME                                                          READY  SEVERITY  REASON  SINCE  MESSAGE
          /tkg-vc-antrea                                                True                     13m
          ├─ClusterInfrastructure - VSphereCluster/tkg-vc-antrea-b7fr9  True                     13m
          ├─ControlPlane - KubeadmControlPlane/tkg-vc-antrea-wdsfx      True                     13m
          │ └─Machine/tkg-vc-antrea-wdsfx-2hkxp                         True                     13m
          └─Workers
            └─MachineDeployment/tkg-vc-antrea-md-0-p9vn5                True                     13m
              └─Machine/tkg-vc-antrea-md-0-p9vn5-645498f59f-shrpt       True                     13m
        
      3. For each worker node listed by kubectl that doesn’t have a Workers > Machine listing from tanzu cluster get:

        1. Scale up the workers to the expected value, for example:

          tanzu cluster scale ${cluster_name} --worker-machine-count 2
          
        2. Use the cluster kubeconfig to drain the ghost node, which moves its workloads to nodes managed by TKG:
          kubectl drain ${node_name} --delete-emptydir-data --ignore-daemonsets
          
        3. Remove the ghost node from the cluster:

          kubectl delete node ${node_name}
          
        4. Log in to vSphere or other infrastructure and manually remove the VM.

  • Stale nodes and ghost infrastructure on control plane

    • Unused nodes and superfluous node infrastructure for control plane
    • Can occur if control plane node was replaced after backup
    • Mitigation:

      1. Get the workload cluster kubeconfig and set it as the kubectl context.
      2. Compare the output of the following kubectl and tanzu commands:

        # Get the actual control plane nodes of the workload cluster
        $ kubectl --context wc-admin@wc get node
        NAME                             STATUS   ROLES           AGE    VERSION
        wc-2cjn4-4xbf8                   Ready    control-plane   107s   v1.26.8+vmware.1
        wc-2cjn4-4zljs                   Ready    control-plane   26h    v1.26.8+vmware.1
        wc-2cjn4-59v95                   Ready    control-plane   26h    v1.26.8+vmware.1
        wc-2cjn4-ncgxb                   Ready    control-plane   25h    v1.26.8+vmware.1
        wc-md-0-nl928-5df8b9bfbd-nww2w   Ready    <none>          26h    v1.26.8+vmware.1
        wc-md-1-j4m55-589cfcd9d6-jxmvc   Ready    <none>          26h    v1.26.8+vmware.1
        wc-md-2-sd4ww-7b7db5dcbb-crwdv   Ready    <none>          26h    v1.26.8+vmware.1
        
        # Get the control plane nodes managed by the TKG
        $ tanzu cluster get wc
        NAME  NAMESPACE  STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES   TKR
        wc    default    updating 4/3           3/3      v1.26.8+vmware.1 <none>  v1.26.8---vmware.2-tkg.1-zshippable
        
        Details:
        
        NAME                                               READY  SEVERITY  REASON  SINCE  MESSAGE
        /wc                                                True                     24m
        ├─ClusterInfrastructure - VSphereCluster/wc-9nq7v  True                     26m
        ├─ControlPlane - KubeadmControlPlane/wc-2cjn4      True                     24m
        │ ├─Machine/wc-2cjn4-4xbf8                         True                     24m
        │ ├─Machine/wc-2cjn4-4zljs                         True                     26m
        │ └─Machine/wc-2cjn4-59v95                         True                     26m
        └─Workers
          ├─MachineDeployment/wc-md-0-nl928                True                     26m
          │ └─Machine/wc-md-0-nl928-5df8b9bfbd-nww2w       True                     26m
          ├─MachineDeployment/wc-md-1-j4m55                True                     26m
          │ └─Machine/wc-md-1-j4m55-589cfcd9d6-jxmvc       True                     26m
          └─MachineDeployment/wc-md-2-sd4ww                True                     26m
            └─Machine/wc-md-2-sd4ww-7b7db5dcbb-crwdv       True                     26m
        
      3. For each control-plane node listed by kubectl that doesn’t have a ControlPlane > Machine listing from tanzu cluster get:

        1. Delete the node:

          kubectl delete node ${node_name}
          
        2. Log in to vSphere or other infrastructure and manually remove the VM.
check-circle-line exclamation-circle-line close-line
Scroll to top icon