Back Up and Restore Management and Workload Cluster Infrastructure on vSphere (Technical Preview)

This topic explains how to back up and restore cluster infrastructure for Tanzu Kubernetes Grid (TKG) with a standalone management cluster on vSphere by:

Using Velero to back up and restore workload cluster objects on the standalone management cluster, and
Re-creating the standalone management cluster from its configuration files

Note

VMware does not support using Velero to back up TKG standalone management clusters.

If a standalone management cluster is reconfigured after it is deployed, re-creating it as described here may not recover all of its resources.

To back up and restore the workloads and dynamic storage volumes hosted on Tanzu Kubernetes Grid (TKG) workload clusters with a standalone management cluster, see Back Up and Restore Cluster Workloads.

To back up and restore vSphere with Tanzu clusters, including Supervisor Clusters and the workload clusters that they create, see Backing Up and Restoring vSphere with Tanzu in the VMware vSphere 8.0 Documentation.

Caution

This feature is in the unsupported Technical Preview state; see TKG Feature States.

Set Up Velero

You can use Velero, an open source community standard tool, to back up and restore TKG standalone management cluster infrastructure and workloads.

Velero supports a variety of storage providers to store its backups. Velero also supports:

Pre- and post-hooks for backup and restore to run custom processes before or after backup and restore events.
Excluding aspects of workload or cluster state that are not well-suited to backup/restore.

A Tanzu Kubernetes Grid subscription includes support for VMware’s tested, compatible distribution of Velero available from the Tanzu Kubernetes Grid downloads page.

To back up and restore TKG clusters, you need:

The Velero CLI v1.10.3 running on your local workstation; see Install the Velero CLI.
A storage provider with locations to save the backups to; see Set Up a Storage Provider.
A Velero server running on the clusters that you are backing up:
- Velero on a workload cluster backs up its workloads and dynamic storage, as described below.
- Velero on a standalone management cluster backs up workload cluster objects, as described in Back Up and Restore Management and Workload Cluster Infrastructure.

After you complete the prerequisites above, you can also use Velero to migrate workloads between clusters. For instructions, see Cluster Migration and Resource Filtering in the Velero documentation.

Install the Velero CLI

Caution
If you have already installed Velero CLI v1.9.x or earlier, as distributed with prior versions of TKG, you need to upgrade to v1.10.3. Older Velero versions do not work with the CRDs used in v1.10 and later. For information, see Upgrade Velero below.

To install the Velero CLI v1.10.3, do the following:

Go to the Broadcom Support Portal and log in with your VMware customer credentials.
Go to the Tanzu Kubernetes Grid downloads page.
Scroll to the Velero entries and download the Velero CLI .gz file for your workstation OS. Its filename starts with velero-linux-, velero-mac-, or velero-windows64-.
Use the gunzip command or the extraction tool of your choice to unpack the binary:
```
gzip -d <RELEASE-TARBALL-NAME>.gz
```
Rename the CLI binary for your platform to velero, make sure that it is executable, and add it to your PATH.
macOS and Linux
1. Move the binary into the /usr/local/bin folder and rename it to velero.
2. Make the file executable:
```
chmod +x /usr/local/bin/velero
```
Windows
1. Create a new Program Files\velero folder and copy the binary into it.
2. Rename the binary to velero.exe.
3. Right-click the velero folder, select Properties > Security, and make sure that your user account has the Full Control permission.
4. Use Windows Search to search for env.
5. Select Edit the system environment variables and click the Environment Variables button.
6. Select the Path row under System variables, and click Edit.
7. Click New to add a new row and enter the path to the velero binary.

Upgrade Velero

Velero v1.10.3 uses different CRDs to v1.9.x. In addition, Velero v1.10 adopted Kopia with Restic as the uploader, which had led to several changes in the naming of components and commands, and in how Velero functions. For more information about breaking changes between v1.9.x and v1.10, see Breaking Changes in the Velero v1.10 Changelog. If you installed Velero v1.9.x with a previous version of TKG, you must upgrade Velero.

Follow the procedure in Install the Velero CLI to install Velero v1.10.3.

Update the CRD definitions with the Velero v1.10 binary.

velero install --crds-only --dry-run -o yaml | kubectl apply -f -

Update the Velero deployment and daemon set configuration to match the component renaming that happened in Velero v1.10.

In the command below, uploader_type can be either restic or kopia.

kubectl get deploy -n velero -ojson \
| sed "s#\"image\"\: \"velero\/velero\:v[0-9]*.[0-9]*.[0-9]\"#\"image\"\: \"velero\/velero\:v1.10.0\"#g" \
| sed "s#\"server\",#\"server\",\"--uploader-type=$uploader_type\",#g" \
| sed "s#default-volumes-to-restic#default-volumes-to-fs-backup#g" \
| sed "s#default-restic-prune-frequency#default-repo-maintain-frequency#g" \
| sed "s#restic-timeout#fs-backup-timeout#g" \
| kubectl apply -f -

(Optional) If you are using the restic daemon set, rename the corresponding components.

echo $(kubectl get ds -n velero restic -ojson) \
| sed "s#\"image\"\: \"velero\/velero\:v[0-9]*.[0-9]*.[0-9]\"#\"image\"\: \"velero\/velero\:v1.10.0\"#g" \
| sed "s#\"name\"\: \"restic\"#\"name\"\: \"node-agent\"#g" \
| sed "s#\[ \"restic\",#\[ \"node-agent\",#g" \
| kubectl apply -f -
kubectl delete ds -n velero restic --force --grace-period 0

For more information, see Upgrading to Velero 1.10 in the Velero documentation.

Set Up a Storage Provider

To back up Tanzu Kubernetes Grid workload cluster contents, you need storage locations for:

Cluster object storage backups for Kubernetes metadata in clusters
Volume snapshots for data used by clusters

See Backup Storage Locations and Volume Snapshot Locations in the Velero documentation. Velero supports a variety of storage providers, which can be either:

An online cloud storage provider.
An on-premises object storage service such as MinIO, for proxied or air-gapped environments.

VMware recommends dedicating a unique storage bucket to each cluster.

To set up MinIO:

Run the minio container image with MinIO credentials and a storage location, for example:

$ docker run -d --name minio --rm -p 9000:9000 -e "MINIO_ACCESS_KEY=minio" -e "MINIO_SECRET_KEY=minio123" -e "MINIO_DEFAULT_BUCKETS=mgmt" gcr.io/velero-gcp/bitnami/minio:2021.6.17-debian-10-r7

Save the credentials to a local file to pass to the --secret-file option of velero install, for example:
```
[default]
aws_access_key_id=minio
aws_secret_access_key=minio123
```

Storage for vSphere

On vSphere, cluster object storage backups and volume snapshots save to the same storage location. This location must be S3-compatible external storage on Amazon Web Services (AWS), or an S3 provider such as MinIO.

To set up storage for Velero on vSphere, see Velero Plugin for vSphere in Vanilla Kubernetes Cluster for the v1.5.1 plugin.

Storage for and on AWS

To set up storage for Velero on AWS, follow the procedures in the Velero Plugins for AWS repository:

Set up S3 storage as needed for each plugin. The object store plugin stores and retrieves cluster object backups, and the volume snapshotter stores and retrieves data volumes.

Storage for and on Azure

To set up storage for Velero on Azure, follow the procedures in the Velero Plugins for Azure repository:

Set up S3 storage as needed for each plugin. The object store plugin stores and retrieves cluster object backups, and the volume snapshotter stores and retrieves data volumes.

Deploy Velero Server to the Management Cluster

To back up workload cluster objects, install Velero v1.10.3 server to the standalone management cluster and verify the installations.

Velero Install Options

To install Velero, run velero install with the following options:

--provider $PROVIDER: For example, aws
--plugins projects.registry.vmware.com/tkg/velero/velero-plugin-for-aws:v1.6.2_vmware.1
--bucket $BUCKET: The name of your S3 bucket
--backup-location-config region=$REGION: The AWS region the bucket is in
--snapshot-location-config region=$REGION: The AWS region the bucket is in
(Optional) --kubeconfig to install the Velero server to a cluster other than the current default.
(Optional) --secret-file ./VELERO-CREDS one way to give Velero access to an S3 bucket is to pass in to this option a local VELERO-CREDS file that looks like:
```
[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
```
For additional options, see Install and start Velero.

Running the velero install command creates a namespace called velero on the cluster, and places a deployment named velero in it.

The following settings are required:

Management cluster plugin: The following plugin is required. The option pauses the clusters and collects the resources related to the clusters being backed up.

--plugins projects.registry.vmware.com/tkg/velero/velero-mgmt-cluster-plugin:v0.2.0_vmware.1

Note

You can add multiple options separated by a comma. For example:

--plugins projects.registry.vmware.com/tkg/velero/velero-plugin-for-aws:v1.6.2_vmware.1,projects.registry.vmware.com/tkg/velero/velero-mgmt-cluster-plugin:v0.2.0_vmware.1

No snapshot location: For backing up cluster infrastructure, do not set --snapshot-location-config

Verify Velero Installation

After the velero install command completes, verify that Velero installed successfully:

Verify that the Velero pod has status Running:

kubectl -n velero get pod
NAME                      READY   STATUS    RESTARTS   AGE
velero-78fdbcd446-v5cqr   1/1     Running   0          3h41m

Verify that the backup location is in the Available phase:

velero backup-location get
NAME      PROVIDER   BUCKET/PREFIX   PHASE       LAST VALIDATED                  ACCESS MODE   DEFAULT
default   aws        mgmt            Available   2022-11-11 05:55:55 +0000 UTC   ReadWrite     true

Back Up Workload Cluster Objects

To back up all the workload cluster objects managed by a standalone management cluster, run:

velero backup create my-backup --exclude-namespaces tkg-system --include-resources cluster.cluster.x-k8s.io --wait

Note

--exclude-namespaces tkg-system excludes the management cluster itself.

--include-resources cluster.cluster.x-k8s.io includes the workload cluster objects

VMware recommends backing up workload clusters immediately after making any structural changes, such as scaling up or down. This avoid a mismatch between backup objects and physical infrastructure that can make the restore process fail.

Scheduling Backups

When cluster objects are changed after the most recent backup, the state of the system after a restore does not match its desired, most recent state. This problem is called “drift”. See the Handling Drift section below for how to detect and recover from some common types of drift.

To minimize drift, VMware recommends using Velero to schedule frequent, regular backups. For example, to back up all workload clusters daily and retain each backup for 14 days:

velero create schedule daily-bak --schedule="@every 24h"  --exclude-namespaces tkg-system --include-resources cluster.cluster.x-k8s.io --ttl 336h0m0s

For more Velero scheduling options, see Schedule a Backup in the Velero documentation.

Regenerating `kubeconfig` Files After Restore

After you use Velero to restore a workload cluster, you need to distribute its new kubeconfig file to anyone who uses it:

Regenerate the kubeconfig:

tanzu cluster kubeconfig get CLUSTER-NAME --namespace NAMESPACE

Distribute the output of the above command to anyone who uses the clusters, to replace their old kubeconfig file.
- kubeconfig files do not contain identities or credentials, and are safe to distribute as described in Learn to use Pinniped for federated authentication to Kubernetes clusters in the Pinniped documentation.

Complete Restore

To restore a standalone management cluster and the workload cluster objects that it manages, you re-create the management cluster from its configuration file, use Velero to restore its workload clusters, and distribute new kubeconfig files to the people who use them:

If you suspect drift between the most recent backup of workload cluster objects and their currently running state, use the Drift Detector tool to generate a remediation report, as described in Using Drift Detector.
Ensure that any configuration changes that were made to the management cluster after it was originally deployed are reflected in its configuration file or in environment variables. Otherwise it will not restore to its most recent state.
Re-create the management cluster from its configuration file, mgmt-cluster-config.yaml here, as described in Deploy Management Clusters from a Configuration File.
- If you deployed the management cluster or its workload clusters to multiple availability zones on vSphere as described in Running Clusters Across Multiple Availability Zones, also include the file with the VSphereFailureDomain and VSphereDeploymentZone object definitions, for example by including --az-file vsphere-zones.yaml in the tanzu mc create command.

Immediately after management cluster is created, there should be only one TKR:

tanzu kubernetes-release get
NAME                       VERSION                  COMPATIBLE  ACTIVE  UPDATES AVAILABLE
v1.26.8---vmware.2-tkg.1  v1.26.8+vmware.1-tkg.1  True        True

Wait a few minutes until all of the TKRs used by backed-up workload clusters become available:

tanzu kubernetes-release get
NAME                       VERSION                  COMPATIBLE  ACTIVE  UPDATES AVAILABLE
v1.24.17---vmware.2-tkg.2  v1.24.17+vmware.2-tkg.2  True        True
v1.25.13---vmware.1-tkg.1  v1.25.13+vmware.1-tkg.1  True        True
v1.26.8---vmware.2-tkg.1   v1.26.8+vmware.1-tkg.1   True        True

Install Velero on the management cluster, following the Deploy Velero Server to Clusters instructions above. Make sure that the credentials and backup location configuration settings have the same values as when the backup was made.

After Velero installs, run velero backup get until the backups are synchronized and the command lists the backup that you want to use:

velero backup get
NAME                 STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
my-backup            Completed   0        0          2022-12-07 17:10:42 +0000 UTC   24d       default            <none>

Run velero restore create to restore the workload cluster resources. VMware recommends using the most recent backup:

velero restore create my-restore --from-backup my-backup --wait

After the restoration complete, the clusters are in createdStalled status:

tanzu cluster list
NAME                NAMESPACE  STATUS          CONTROLPLANE  WORKERS  KUBERNETES         ROLES   PLAN  TKR
tkg-vc-antrea       default    createdStalled  0/3           0/3      v1.26.8+vmware.1   <none>  prod  v1.26.8---vmware.2-tkg.1

Patch the cluster objects to set their paused property to false. This is required because cluster objects are re-created in a paused state on the new management cluster, to prevent their controllers from trying to reconcile:

To unpause a cluster after it is restored, run:

kubectl -n my-namespace patch cluster CLUSTER-NAME --type merge -p '{"spec":{"paused":false}}'

To unpause all clusters in multiple namespaces, run the script:

#!/bin/bash

for ns in $(kubectl get ns -o custom-columns=":metadata.name" | grep -v "tkg-system");
do
      clusters=$(kubectl -n $ns get cluster -o name)
      if [[ -n $clusters ]];then
              kubectl -n $ns patch $clusters --type merge -p '{"spec":{"paused":false}}'
      fi
done

Verify that all workload clusters are in the running state, for example:

tanzu cluster list
NAME                NAMESPACE  STATUS   CONTROLPLANE  WORKERS  KUBERNETES         ROLES   PLAN  TKR
tkg-vc-antrea       default    running  3/3           3/3      v1.26.8+vmware.1   <none>  prod  v1.26.8---vmware.2-tkg.1

For each workload cluster, run tanzu cluster get CLUSTER-NAME to check that all components are in the running state, for example:

tanzu cluster get tkg-vc-antrea
  NAME           NAMESPACE  STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES   TKR
  tkg-vc-antrea  default    running  3/3           3/3      v1.26.8+vmware.1 <none>  v1.26.8---vmware.2-tkg.1

Details:

NAME                                                          READY  SEVERITY  REASON  SINCE  MESSAGE
/tkg-vc-antrea                                                True                     4h14m
??????ClusterInfrastructure - VSphereCluster/tkg-vc-antrea-s6kl5  True                     4h36m
??????ControlPlane - KubeadmControlPlane/tkg-vc-antrea-ch5hn      True                     4h14m
??? ??????Machine/tkg-vc-antrea-ch5hn-8gfvt                         True                     4h14m
??? ??????Machine/tkg-vc-antrea-ch5hn-vdcrp                         True                     4h23m
??? ??????Machine/tkg-vc-antrea-ch5hn-x7nmm                         True                     4h32m
??????Workers
  ??????MachineDeployment/tkg-vc-antrea-md-0-8b8zn                True                     4h23m
  ??? ??????Machine/tkg-vc-antrea-md-0-8b8zn-798d5b8897-bnxn9       True                     4h24m
  ??????MachineDeployment/tkg-vc-antrea-md-1-m6dvh                True                     4h24m
  ??? ??????Machine/tkg-vc-antrea-md-1-m6dvh-79fb858b96-p9667       True                     4h28m
  ??????MachineDeployment/tkg-vc-antrea-md-2-brm2m                True                     4h21m
    ??????Machine/tkg-vc-antrea-md-2-brm2m-6478cffc5f-tq5cn       True                     4h23m

After all workload clusters are running, you can manage the workload clusters with the Tanzu CLI.

If you ran Drift Detector before you re-created the management cluster, manually remediate or investigate any objects flagged in the Drift Detector report as described in Remediate Drift.
Regenerate and distribute new kubeconfig files for the management cluster and its workload clusters:
1. Regenerate the management cluster kubeconfig:
```
tanzu management-cluster kubeconfig get
```
2. For each workload cluster, regenerate its kubeconfig:
```
tanzu cluster kubeconfig get CLUSTER-NAME --namespace NAMESPACE
```
3. Distribute the outputs of the above commands to anyone who uses the clusters, to replace their old kubeconfig files.
  - kubeconfig files do not contain identities or credentials, and are safe to distribute as described in Learn to use Pinniped for federated authentication to Kubernetes clusters in the Pinniped documentation.

Handling Drift

Drift occurs when cluster objects have changed since their most recent backup, and so the state of the system after a restore does not match its desired, most recent state.

To minimize drift, VMware recommends scheduling frequent, regular backups.

To help detect and remediate drift, you can use the Drift Detector tool described in the sections below.

Using Drift Detector

Drift Detector is a command-line tool that:

Compares the content of a TKG backup with the current state of TKG cluster object infrastructure, and
Generates a report that lists potential issues and steps for remediating drift

Important
Drift Detector is in the unsupported Experimental state. Drift is complicated, and the Drift Detector may not detect all instances of drift. It should only be used as a reference, and never as a substitute for regular backups.

For how to install and use Drift Detector, see Drift Detector for Tanzu Kubernetes Grid Management Cluster on VMware KB website. The overall process is:

Before you restore TKG from backup, run the drift-detector command to generate a report.
Download and restore TKG from the most recent backup.
Referring to the Drift Detector report, follow the guidance in Remediating Drift to take remediation actions on the restored state of TKG.

Remediating Drift

Drift cases can be complicated, but if you have a Drift Detector report or otherwise detect some drift in your cluster object state since the last backup, you can remediate some common patterns as follows:

Stale worker nodes:
- Extra, unused nodes
- Can occur if worker node count was scaled down after backup
- Mitigation often unnecessary. After restore, Machine Health Check deletes the stale machine objects and a new nodes are created to meet the desired machine count.

Ghost worker node infrastructure:

Superfluous, unmanaged node infrastructure
Can occur if worker node count was scaled up after backup

Mitigation:

Get the workload cluster kubeconfig and set it as the kubectl context.

Compare the output of the following kubectl and tanzu commands:

# Get the actual worker nodes of the workload cluster
$ kubectl --context tkg-vc-antrea-admin@tkg-vc-antrea get node
NAME                                        STATUS   ROLES           AGE    VERSION
tkg-vc-antrea-md-0-p9vn5-645498f59f-42qh9   Ready    <none>          44m    v1.26.8+vmware.1
tkg-vc-antrea-md-0-p9vn5-645498f59f-shrpt   Ready    <none>          114m   v1.26.8+vmware.1
tkg-vc-antrea-wdsfx-2hkxp                   Ready    control-plane   116m   v1.26.8+vmware.1

# Get the worker nodes managed by the TKG
$ tanzu cluster get tkg-vc-antrea
  NAME           NAMESPACE  STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES   TKR
  tkg-vc-antrea  default    running  1/1           1/1      v1.26.8+vmware.1  <none>  v1.26.8---vmware.2-tkg.1-zshippable

  Details:

  NAME                                                          READY  SEVERITY  REASON  SINCE  MESSAGE
  /tkg-vc-antrea                                                True                     13m
  ??????ClusterInfrastructure - VSphereCluster/tkg-vc-antrea-b7fr9  True                     13m
  ??????ControlPlane - KubeadmControlPlane/tkg-vc-antrea-wdsfx      True                     13m
  ??? ??????Machine/tkg-vc-antrea-wdsfx-2hkxp                         True                     13m
  ??????Workers
    ??????MachineDeployment/tkg-vc-antrea-md-0-p9vn5                True                     13m
      ??????Machine/tkg-vc-antrea-md-0-p9vn5-645498f59f-shrpt       True                     13m

For each worker node listed by kubectl that doesn’t have a Workers > Machine listing from tanzu cluster get:
1. Scale up the workers to the expected value, for example:
```
tanzu cluster scale ${cluster_name} --worker-machine-count 2
```
2. Use the cluster kubeconfig to drain the ghost node, which moves its workloads to nodes managed by TKG:
```
kubectl drain ${node_name} --delete-emptydir-data --ignore-daemonsets
```
3. Remove the ghost node from the cluster:
```
kubectl delete node ${node_name}
```
4. Log in to vSphere or other infrastructure and manually remove the VM.

Stale nodes and ghost infrastructure on control plane

Unused nodes and superfluous node infrastructure for control plane
Can occur if control plane node was replaced after backup