This topic explains how to back up and restore cluster infrastructure for Tanzu Kubernetes Grid (TKG) with a standalone management cluster on vSphere by:
Note
- VMware does not support using Velero to back up TKG standalone management clusters.
- If a standalone management cluster is reconfigured after it is deployed, re-creating it as described here may not recover all of its resources.
To back up and restore the workloads and dynamic storage volumes hosted on Tanzu Kubernetes Grid (TKG) workload clusters with a standalone management cluster, see Back Up and Restore Cluster Workloads.
To back up and restore vSphere with Tanzu clusters, including Supervisor Clusters and the workload clusters that they create, see Backing Up and Restoring vSphere with Tanzu in the VMware vSphere 8.0 Documentation.
Caution
- This feature is in the unsupported Technical Preview state; see TKG Feature States.
You can use Velero, an open source community standard tool, to back up and restore TKG standalone management cluster infrastructure and workloads.
Velero supports a variety of storage providers to store its backups. Velero also supports:
A Tanzu Kubernetes Grid subscription includes support for VMware’s tested, compatible distribution of Velero available from the Tanzu Kubernetes Grid downloads page.
To back up and restore TKG clusters, you need:
After you complete the prerequisites above, you can also use Velero to migrate workloads between clusters. For instructions, see Cluster Migration and Resource Filtering in the Velero documentation.
CautionIf you have already installed Velero CLI v1.9.x or earlier, as distributed with prior versions of TKG, you need to upgrade to v1.10.3. Older Velero versions do not work with the CRDs used in v1.10 and later. For information, see Upgrade Velero below.
To install the Velero CLI v1.10.3, do the following:
.gz
file for your workstation OS. Its filename starts with velero-linux-
, velero-mac-
, or velero-windows64-
.Use the gunzip
command or the extraction tool of your choice to unpack the binary:
gzip -d <RELEASE-TARBALL-NAME>.gz
Rename the CLI binary for your platform to velero
, make sure that it is executable, and add it to your PATH
.
/usr/local/bin
folder and rename it to velero
.chmod +x /usr/local/bin/velero
Program Files\velero
folder and copy the binary into it.velero.exe
.velero
folder, select Properties > Security, and make sure that your user account has the Full Control permission.env
.Path
row under System variables, and click Edit.velero
binary.Velero v1.10.3 uses different CRDs to v1.9.x. In addition, Velero v1.10 adopted Kopia with Restic as the uploader, which had led to several changes in the naming of components and commands, and in how Velero functions. For more information about breaking changes between v1.9.x and v1.10, see Breaking Changes in the Velero v1.10 Changelog. If you installed Velero v1.9.x with a previous version of TKG, you must upgrade Velero.
Update the CRD definitions with the Velero v1.10 binary.
velero install --crds-only --dry-run -o yaml | kubectl apply -f -
Update the Velero deployment and daemon set configuration to match the component renaming that happened in Velero v1.10.
In the command below, uploader_type
can be either restic
or kopia
.
kubectl get deploy -n velero -ojson \
| sed "s#\"image\"\: \"velero\/velero\:v[0-9]*.[0-9]*.[0-9]\"#\"image\"\: \"velero\/velero\:v1.10.0\"#g" \
| sed "s#\"server\",#\"server\",\"--uploader-type=$uploader_type\",#g" \
| sed "s#default-volumes-to-restic#default-volumes-to-fs-backup#g" \
| sed "s#default-restic-prune-frequency#default-repo-maintain-frequency#g" \
| sed "s#restic-timeout#fs-backup-timeout#g" \
| kubectl apply -f -
(Optional) If you are using the restic
daemon set, rename the corresponding components.
echo $(kubectl get ds -n velero restic -ojson) \
| sed "s#\"image\"\: \"velero\/velero\:v[0-9]*.[0-9]*.[0-9]\"#\"image\"\: \"velero\/velero\:v1.10.0\"#g" \
| sed "s#\"name\"\: \"restic\"#\"name\"\: \"node-agent\"#g" \
| sed "s#\[ \"restic\",#\[ \"node-agent\",#g" \
| kubectl apply -f -
kubectl delete ds -n velero restic --force --grace-period 0
For more information, see Upgrading to Velero 1.10 in the Velero documentation.
To back up Tanzu Kubernetes Grid workload cluster contents, you need storage locations for:
See Backup Storage Locations and Volume Snapshot Locations in the Velero documentation. Velero supports a variety of storage providers, which can be either:
VMware recommends dedicating a unique storage bucket to each cluster.
To set up MinIO:
Run the minio
container image with MinIO credentials and a storage location, for example:
$ docker run -d --name minio --rm -p 9000:9000 -e "MINIO_ACCESS_KEY=minio" -e "MINIO_SECRET_KEY=minio123" -e "MINIO_DEFAULT_BUCKETS=mgmt" gcr.io/velero-gcp/bitnami/minio:2021.6.17-debian-10-r7
Save the credentials to a local file to pass to the --secret-file
option of velero install
, for example:
[default]
aws_access_key_id=minio
aws_secret_access_key=minio123
On vSphere, cluster object storage backups and volume snapshots save to the same storage location. This location must be S3-compatible external storage on Amazon Web Services (AWS), or an S3 provider such as MinIO.
To set up storage for Velero on vSphere, see Velero Plugin for vSphere in Vanilla Kubernetes Cluster for the v1.5.1 plugin.
To set up storage for Velero on AWS, follow the procedures in the Velero Plugins for AWS repository:
Set up S3 storage as needed for each plugin. The object store plugin stores and retrieves cluster object backups, and the volume snapshotter stores and retrieves data volumes.
To set up storage for Velero on Azure, follow the procedures in the Velero Plugins for Azure repository:
Set up S3 storage as needed for each plugin. The object store plugin stores and retrieves cluster object backups, and the volume snapshotter stores and retrieves data volumes.
To back up workload cluster objects, install Velero v1.10.3 server to the standalone management cluster and verify the installations.
To install Velero, run velero install
with the following options:
--provider $PROVIDER
: For example, aws
--plugins projects.registry.vmware.com/tkg/velero/velero-plugin-for-aws:v1.6.2_vmware.1
--bucket $BUCKET
: The name of your S3 bucket--backup-location-config region=$REGION
: The AWS region the bucket is in--snapshot-location-config region=$REGION
: The AWS region the bucket is in--kubeconfig
to install the Velero server to a cluster other than the current default.(Optional) --secret-file ./VELERO-CREDS
one way to give Velero access to an S3 bucket is to pass in to this option a local VELERO-CREDS
file that looks like:
[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
For additional options, see Install and start Velero.
Running the velero install
command creates a namespace called velero
on the cluster, and places a deployment named velero
in it.
The following settings are required:
--plugins projects.registry.vmware.com/tkg/velero/velero-mgmt-cluster-plugin:v0.2.0_vmware.1
NoteYou can add multiple options separated by a comma. For example:
--plugins projects.registry.vmware.com/tkg/velero/velero-plugin-for-aws:v1.6.2_vmware.1,projects.registry.vmware.com/tkg/velero/velero-mgmt-cluster-plugin:v0.2.0_vmware.1
--snapshot-location-config
After the velero install
command completes, verify that Velero installed successfully:
Verify that the Velero pod has status Running
:
kubectl -n velero get pod
NAME READY STATUS RESTARTS AGE
velero-78fdbcd446-v5cqr 1/1 Running 0 3h41m
Verify that the backup location is in the Available
phase:
velero backup-location get
NAME PROVIDER BUCKET/PREFIX PHASE LAST VALIDATED ACCESS MODE DEFAULT
default aws mgmt Available 2022-11-11 05:55:55 +0000 UTC ReadWrite true
To back up all the workload cluster objects managed by a standalone management cluster, run:
velero backup create my-backup --exclude-namespaces tkg-system --include-resources cluster.cluster.x-k8s.io --wait
Note
--exclude-namespaces tkg-system
excludes the management cluster itself.
--include-resources cluster.cluster.x-k8s.io
includes the workload cluster objectsVMware recommends backing up workload clusters immediately after making any structural changes, such as scaling up or down. This avoid a mismatch between backup objects and physical infrastructure that can make the restore process fail.
When cluster objects are changed after the most recent backup, the state of the system after a restore does not match its desired, most recent state. This problem is called “drift”. See the Handling Drift section below for how to detect and recover from some common types of drift.
To minimize drift, VMware recommends using Velero to schedule frequent, regular backups. For example, to back up all workload clusters daily and retain each backup for 14 days:
velero create schedule daily-bak --schedule="@every 24h" --exclude-namespaces tkg-system --include-resources cluster.cluster.x-k8s.io --ttl 336h0m0s
For more Velero scheduling options, see Schedule a Backup in the Velero documentation.
kubeconfig
Files After RestoreAfter you use Velero to restore a workload cluster, you need to distribute its new kubeconfig
file to anyone who uses it:
Regenerate the kubeconfig
:
tanzu cluster kubeconfig get CLUSTER-NAME --namespace NAMESPACE
Distribute the output of the above command to anyone who uses the clusters, to replace their old kubeconfig
file.
kubeconfig
files do not contain identities or credentials, and are safe to distribute as described in Learn to use Pinniped for federated authentication to Kubernetes clusters in the Pinniped documentation.To restore a standalone management cluster and the workload cluster objects that it manages, you re-create the management cluster from its configuration file, use Velero to restore its workload clusters, and distribute new kubeconfig
files to the people who use them:
If you suspect drift between the most recent backup of workload cluster objects and their currently running state, use the Drift Detector tool to generate a remediation report, as described in Using Drift Detector.
Ensure that any configuration changes that were made to the management cluster after it was originally deployed are reflected in its configuration file or in environment variables. Otherwise it will not restore to its most recent state.
Re-create the management cluster from its configuration file, mgmt-cluster-config.yaml
here, as described in Deploy Management Clusters from a Configuration File.
VSphereFailureDomain
and VSphereDeploymentZone
object definitions, for example by including --az-file vsphere-zones.yaml
in the tanzu mc create
command.Immediately after management cluster is created, there should be only one TKR:
tanzu kubernetes-release get
NAME VERSION COMPATIBLE ACTIVE UPDATES AVAILABLE
v1.26.8---vmware.2-tkg.1 v1.26.8+vmware.1-tkg.1 True True
Wait a few minutes until all of the TKRs used by backed-up workload clusters become available:
tanzu kubernetes-release get
NAME VERSION COMPATIBLE ACTIVE UPDATES AVAILABLE
v1.24.17---vmware.2-tkg.2 v1.24.17+vmware.2-tkg.2 True True
v1.25.13---vmware.1-tkg.1 v1.25.13+vmware.1-tkg.1 True True
v1.26.8---vmware.2-tkg.1 v1.26.8+vmware.1-tkg.1 True True
Install Velero on the management cluster, following the Deploy Velero Server to Clusters instructions above. Make sure that the credentials and backup location configuration settings have the same values as when the backup was made.
After Velero installs, run velero backup get
until the backups are synchronized and the command lists the backup that you want to use:
velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
my-backup Completed 0 0 2022-12-07 17:10:42 +0000 UTC 24d default <none>
Run velero restore create
to restore the workload cluster resources. VMware recommends using the most recent backup:
velero restore create my-restore --from-backup my-backup --wait
After the restoration complete, the clusters are in createdStalled
status:
tanzu cluster list
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES PLAN TKR
tkg-vc-antrea default createdStalled 0/3 0/3 v1.26.8+vmware.1 <none> prod v1.26.8---vmware.2-tkg.1
Patch the cluster objects to set their paused
property to false
. This is required because cluster objects are re-created in a paused
state on the new management cluster, to prevent their controllers from trying to reconcile:
To unpause a cluster after it is restored, run:
kubectl -n my-namespace patch cluster CLUSTER-NAME --type merge -p '{"spec":{"paused":false}}'
To unpause all clusters in multiple namespaces, run the script:
#!/bin/bash
for ns in $(kubectl get ns -o custom-columns=":metadata.name" | grep -v "tkg-system");
do
clusters=$(kubectl -n $ns get cluster -o name)
if [[ -n $clusters ]];then
kubectl -n $ns patch $clusters --type merge -p '{"spec":{"paused":false}}'
fi
done
Verify that all workload clusters are in the running
state, for example:
tanzu cluster list
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES PLAN TKR
tkg-vc-antrea default running 3/3 3/3 v1.26.8+vmware.1 <none> prod v1.26.8---vmware.2-tkg.1
For each workload cluster, run tanzu cluster get CLUSTER-NAME
to check that all components are in the running
state, for example:
tanzu cluster get tkg-vc-antrea
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES TKR
tkg-vc-antrea default running 3/3 3/3 v1.26.8+vmware.1 <none> v1.26.8---vmware.2-tkg.1
Details:
NAME READY SEVERITY REASON SINCE MESSAGE
/tkg-vc-antrea True 4h14m
??????ClusterInfrastructure - VSphereCluster/tkg-vc-antrea-s6kl5 True 4h36m
??????ControlPlane - KubeadmControlPlane/tkg-vc-antrea-ch5hn True 4h14m
??? ??????Machine/tkg-vc-antrea-ch5hn-8gfvt True 4h14m
??? ??????Machine/tkg-vc-antrea-ch5hn-vdcrp True 4h23m
??? ??????Machine/tkg-vc-antrea-ch5hn-x7nmm True 4h32m
??????Workers
??????MachineDeployment/tkg-vc-antrea-md-0-8b8zn True 4h23m
??? ??????Machine/tkg-vc-antrea-md-0-8b8zn-798d5b8897-bnxn9 True 4h24m
??????MachineDeployment/tkg-vc-antrea-md-1-m6dvh True 4h24m
??? ??????Machine/tkg-vc-antrea-md-1-m6dvh-79fb858b96-p9667 True 4h28m
??????MachineDeployment/tkg-vc-antrea-md-2-brm2m True 4h21m
??????Machine/tkg-vc-antrea-md-2-brm2m-6478cffc5f-tq5cn True 4h23m
After all workload clusters are running, you can manage the workload clusters with the Tanzu CLI.
If you ran Drift Detector before you re-created the management cluster, manually remediate or investigate any objects flagged in the Drift Detector report as described in Remediate Drift.
Regenerate and distribute new kubeconfig
files for the management cluster and its workload clusters:
Regenerate the management cluster kubeconfig
:
tanzu management-cluster kubeconfig get
For each workload cluster, regenerate its kubeconfig
:
tanzu cluster kubeconfig get CLUSTER-NAME --namespace NAMESPACE
Distribute the outputs of the above commands to anyone who uses the clusters, to replace their old kubeconfig
files.
kubeconfig
files do not contain identities or credentials, and are safe to distribute as described in Learn to use Pinniped for federated authentication to Kubernetes clusters in the Pinniped documentation.Drift occurs when cluster objects have changed since their most recent backup, and so the state of the system after a restore does not match its desired, most recent state.
To minimize drift, VMware recommends scheduling frequent, regular backups.
To help detect and remediate drift, you can use the Drift Detector tool described in the sections below.
Drift Detector is a command-line tool that:
ImportantDrift Detector is in the unsupported Experimental state. Drift is complicated, and the Drift Detector may not detect all instances of drift. It should only be used as a reference, and never as a substitute for regular backups.
For how to install and use Drift Detector, see Drift Detector for Tanzu Kubernetes Grid Management Cluster on VMware KB website. The overall process is:
Before you restore TKG from backup, run the drift-detector
command to generate a report.
Download and restore TKG from the most recent backup.
Referring to the Drift Detector report, follow the guidance in Remediating Drift to take remediation actions on the restored state of TKG.
Drift cases can be complicated, but if you have a Drift Detector report or otherwise detect some drift in your cluster object state since the last backup, you can remediate some common patterns as follows:
Stale worker nodes:
Ghost worker node infrastructure:
Mitigation:
kubeconfig
and set it as the kubectl
context.Compare the output of the following kubectl
and tanzu
commands:
# Get the actual worker nodes of the workload cluster
$ kubectl --context tkg-vc-antrea-admin@tkg-vc-antrea get node
NAME STATUS ROLES AGE VERSION
tkg-vc-antrea-md-0-p9vn5-645498f59f-42qh9 Ready <none> 44m v1.26.8+vmware.1
tkg-vc-antrea-md-0-p9vn5-645498f59f-shrpt Ready <none> 114m v1.26.8+vmware.1
tkg-vc-antrea-wdsfx-2hkxp Ready control-plane 116m v1.26.8+vmware.1
# Get the worker nodes managed by the TKG
$ tanzu cluster get tkg-vc-antrea
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES TKR
tkg-vc-antrea default running 1/1 1/1 v1.26.8+vmware.1 <none> v1.26.8---vmware.2-tkg.1-zshippable
Details:
NAME READY SEVERITY REASON SINCE MESSAGE
/tkg-vc-antrea True 13m
??????ClusterInfrastructure - VSphereCluster/tkg-vc-antrea-b7fr9 True 13m
??????ControlPlane - KubeadmControlPlane/tkg-vc-antrea-wdsfx True 13m
??? ??????Machine/tkg-vc-antrea-wdsfx-2hkxp True 13m
??????Workers
??????MachineDeployment/tkg-vc-antrea-md-0-p9vn5 True 13m
??????Machine/tkg-vc-antrea-md-0-p9vn5-645498f59f-shrpt True 13m
For each worker node listed by kubectl
that doesn’t have a Workers
> Machine
listing from tanzu cluster get
:
Scale up the workers to the expected value, for example:
tanzu cluster scale ${cluster_name} --worker-machine-count 2
kubeconfig
to drain the ghost node, which moves its workloads to nodes managed by TKG: kubectl drain ${node_name} --delete-emptydir-data --ignore-daemonsets
Remove the ghost node from the cluster:
kubectl delete node ${node_name}
Log in to vSphere or other infrastructure and manually remove the VM.
Stale nodes and ghost infrastructure on control plane
Mitigation:
kubeconfig
and set it as the kubectl
context.Compare the output of the following kubectl
and tanzu
commands:
# Get the actual control plane nodes of the workload cluster
$ kubectl --context wc-admin@wc get node
NAME STATUS ROLES AGE VERSION
wc-2cjn4-4xbf8 Ready control-plane 107s v1.26.8+vmware.1
wc-2cjn4-4zljs Ready control-plane 26h v1.26.8+vmware.1
wc-2cjn4-59v95 Ready control-plane 26h v1.26.8+vmware.1
wc-2cjn4-ncgxb Ready control-plane 25h v1.26.8+vmware.1
wc-md-0-nl928-5df8b9bfbd-nww2w Ready <none> 26h v1.26.8+vmware.1
wc-md-1-j4m55-589cfcd9d6-jxmvc Ready <none> 26h v1.26.8+vmware.1
wc-md-2-sd4ww-7b7db5dcbb-crwdv Ready <none> 26h v1.26.8+vmware.1
# Get the control plane nodes managed by the TKG
$ tanzu cluster get wc
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES TKR
wc default updating 4/3 3/3 v1.26.8+vmware.1 <none> v1.26.8---vmware.2-tkg.1-zshippable
Details:
NAME READY SEVERITY REASON SINCE MESSAGE
/wc True 24m
??????ClusterInfrastructure - VSphereCluster/wc-9nq7v True 26m
??????ControlPlane - KubeadmControlPlane/wc-2cjn4 True 24m
??? ??????Machine/wc-2cjn4-4xbf8 True 24m
??? ??????Machine/wc-2cjn4-4zljs True 26m
??? ??????Machine/wc-2cjn4-59v95 True 26m
??????Workers
??????MachineDeployment/wc-md-0-nl928 True 26m
??? ??????Machine/wc-md-0-nl928-5df8b9bfbd-nww2w True 26m
??????MachineDeployment/wc-md-1-j4m55 True 26m
??? ??????Machine/wc-md-1-j4m55-589cfcd9d6-jxmvc True 26m
??????MachineDeployment/wc-md-2-sd4ww True 26m
??????Machine/wc-md-2-sd4ww-7b7db5dcbb-crwdv True 26m
For each control-plane
node listed by kubectl
that doesn’t have a ControlPlane
> Machine
listing from tanzu cluster get
:
Delete the node:
kubectl delete node ${node_name}