This topic explains how to back up and restore cluster infrastructure for Tanzu Kubernetes Grid (TKG) with a standalone management cluster on vSphere by:
Note
- VMware does not support using Velero to back up TKG standalone management clusters.
- If a standalone management cluster is reconfigured after it is deployed, re-creating it as described here may not recover all of its resources.
To back up and restore the workloads and dynamic storage volumes hosted on Tanzu Kubernetes Grid (TKG) workload clusters with a standalone management cluster, see Back Up and Restore Cluster Workloads.
To back up and restore vSphere with Tanzu clusters, including Supervisor Clusters and the workload clusters that they create, see Backing Up and Restoring vSphere with Tanzu in the VMware vSphere 8.0 Documentation.
Caution
This feature is in the unsupported Technical Preview state; see TKG Feature States.
Pinniped authentication to workload clusters does not work after their management cluster has been re-created.
You can use Velero, an open source community standard tool, to back up and restore TKG standalone management cluster infrastructure and workloads.
Velero supports a variety of storage providers to store its backups.
Velero also supports:
A Tanzu Kubernetes Grid subscription includes support for VMware’s tested, compatible distribution of Velero available from the Tanzu Kubernetes Grid downloads page.
To back up and restore TKG clusters, you need:
CautionIf you have already installed Velero CLI v1.8.1 or earlier, as distributed with prior versions of TKG, you need to upgrade to v1.9.5. Older Velero versions do not work with the CRDs used in v1.9 and later.
To install the Velero CLI v1.9.5, do the following:
.gz
file for your workstation OS. Its filename starts with velero-linux-
, velero-mac-
, or velero-windows64-
.Use the gunzip
command or the extraction tool of your choice to unpack the binary:
gzip -d <RELEASE-TARBALL-NAME>.gz
Rename the CLI binary for your platform to velero
, make sure that it is executable, and add it to your PATH
.
macOS and Linux platforms:
/usr/local/bin
folder and rename it to velero
.chmod +x /usr/local/bin/velero
Windows platforms:
Program Files\velero
folder and copy the binary into it.velero.exe
.velero
folder, select Properties > Security, and make sure that your user account has the Full Control permission.env
.Path
row under System variables, and click Edit.velero
binary.To back up Tanzu Kubernetes Grid workload cluster contents, you need storage locations for:
See Backup Storage Locations and Volume Snapshot Locations in the Velero documentation. Velero supports a variety of storage providers, which can be either:
VMware recommends dedicating a unique storage bucket to each cluster.
To set up MinIO:
Run the minio
container image with MinIO credentials and a storage location, for example:
$ docker run -d --name minio --rm -p 9000:9000 -e "MINIO_ACCESS_KEY=minio" -e "MINIO_SECRET_KEY=minio123" -e "MINIO_DEFAULT_BUCKETS=mgmt" gcr.io/velero-gcp/bitnami/minio:2021.6.17-debian-10-r7
Save the credentials to a local file to pass to the --secret-file
option of velero install
, for example:
[default]
aws_access_key_id=minio
aws_secret_access_key=minio123
On vSphere, cluster object storage backups and volume snapshots save to the same storage location. This location must be S3-compatible external storage on Amazon Web Services (AWS), or an S3 provider such as MinIO.
To set up storage for Velero on vSphere, see Velero Plugin for vSphere in Vanilla Kubernetes Cluster for the v1.4.2 plugin.
To set up storage for Velero on AWS, follow the procedures in the Velero Plugins for AWS repository:
Set up S3 storage as needed for each plugin. The object store plugin stores and retrieves cluster object backups, and the volume snapshotter stores and retrieves data volumes.
To set up storage for Velero on Azure, follow the procedures in the Velero Plugins for Azure repository:
Set up S3 storage as needed for each plugin. The object store plugin stores and retrieves cluster object backups, and the volume snapshotter stores and retrieves data volumes.
To back up workload cluster objects, install Velero v1.9.5 server to the standalone management cluster and verify the installations.
To install Velero, run velero install
with the following options:
--provider $PROVIDER
: For example, aws
--plugins projects.registry.vmware.com/tkg/velero/velero-plugin-for-aws:v1.5.3_vmware.1
--bucket $BUCKET
: The name of your S3 bucket--backup-location-config region=$REGION
: The AWS region the bucket is in--snapshot-location-config region=$REGION
: The AWS region the bucket is in--kubeconfig
to install the Velero server to a cluster other than the current default.(Optional) --secret-file ./VELERO-CREDS
one way to give Velero access to an S3 bucket is to pass in to this option a local VELERO-CREDS
file that looks like:
[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
For additional options, see Install and start Velero.
Running the velero install
command creates a namespace called velero
on the cluster, and places a deployment named velero
in it.
The following settings are required:
--plugins projects.registry.vmware.com/tkg/velero/velero-mgmt-cluster-plugin:v0.1.0_vmware.1
NoteYou can add multiple options separated by a comma. For example:
--plugins projects.registry.vmware.com/tkg/velero/velero-plugin-for-aws:v1.5.3_vmware.1,projects.registry.vmware.com/tkg/velero/velero-mgmt-cluster-plugin:v0.1.0_vmware.1
--snapshot-location-config
After the velero install
command completes, verify that Velero installed successfully:
Verify that the Velero pod has status Running
:
kubectl -n velero get pod
NAME READY STATUS RESTARTS AGE
velero-78fdbcd446-v5cqr 1/1 Running 0 3h41m
Verify that the backup location is in the Available
phase:
velero backup-location get
NAME PROVIDER BUCKET/PREFIX PHASE LAST VALIDATED ACCESS MODE DEFAULT
default aws mgmt Available 2022-11-11 05:55:55 +0000 UTC ReadWrite true
To back up all the workload cluster objects managed by a standalone management cluster, run:
velero backup create my-backup --exclude-namespaces tkg-system --include-resources cluster.cluster.x-k8s.io --wait
Notes
--exclude-namespaces tkg-system
excludes the management cluster itself.
--include-resources cluster.cluster.x-k8s.io
includes the workload cluster objectsVMware recommends backing up workload clusters immediately after making any structural changes, such as scaling up or down. This avoid a mismatch between backup objects and physical infrastructure that can make the restore process fail.
When cluster objects are changed after the most recent backup, the state of the system after a restore does not match its desired, most recent state. This problem is called “drift”. See the Handling Drift section below for how to recover from some common types of drift.
To mitigate drift, VMware recommends using Velero to schedule regular backups. For example, to back up all workload clusters daily and retain each backup for 14 days:
velero create schedule daily-bak --schedule="@every 24h" --exclude-namespaces tkg-system --include-resources cluster.cluster.x-k8s.io --ttl 336h0m0s
For more Velero scheduling options, see Schedule a Backup in the Velero documentation.
To restore a standalone management cluster and the workload cluster objects that it manages:
Re-create the management cluster from its configuration file, mgmt-cluster-config.yaml
here, as described in Deploy Management Clusters from a Configuration File.
NoteAny configuration changes applied to the management cluster after it was deployed must be reflected in the configuration file or environment variables, or they will not restore.
Immediately after management cluster is created, there should be only one TKR:
tanzu kubernetes-release get
NAME VERSION COMPATIBLE ACTIVE UPDATES AVAILABLE
v1.24.10---vmware.1-tkg.1 v1.24.10+vmware.1-tkg.1 True True
Wait a few minutes until all of the TKRs used by backed-up workload clusters become available:
tanzu kubernetes-release get
NAME VERSION COMPATIBLE ACTIVE UPDATES AVAILABLE
v1.22.17---vmware.2-tkg.2 v1.22.17+vmware.2-tkg.2 True True
v1.23.16---vmware.1-tkg.1 v1.23.16+vmware.1-tkg.1 True True
v1.24.10---vmware.1-tkg.1 v1.24.10+vmware.1-tkg.1 True True
Install Velero on the management cluster, following the Deploy Velero Server to Clusters instructions above. Make sure that the credentials and backup location configuration settings have the same values as when the backup was made.
After Velero installs, run velero backup get
until the backups are synchronized and the command lists the backup that you want to use:
velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
my-backup Completed 0 0 2022-12-07 17:10:42 +0000 UTC 24d default <none>
Run velero restore create
to restore the workload cluster resources. VMware recommends using the most recent backup:
velero restore create my-restore --from-backup my-backup --wait
After the restoration complete, the clusters are in createdStalled
status:
tanzu cluster list
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES PLAN TKR
tkg-vc-antrea default createdStalled 0/3 0/3 v1.24.10+vmware.1 <none> prod v1.24.10---vmware.1-tkg.1
Patch the cluster objects to set their paused
property to false
. This is required because cluster objects are re-created in a paused
state on the new management cluster, to prevent their controllers from trying to reconcile:
To unpause a cluster after it is restored, run:
kubectl -n my-namespace patch cluster CLUSTER-NAME --type merge -p '{"spec":{"paused":false}}'
To unpause all clusters in multiple namespaces, run the script:
#!/bin/bash
for ns in $(kubectl get ns -o custom-columns=":metadata.name" | grep -v "tkg-system");
do
clusters=$(kubectl -n $ns get cluster -o name)
if [[ -n $clusters ]];then
kubectl -n $ns patch $clusters --type merge -p '{"spec":{"paused":false}}'
fi
done
Verify that all workload clusters are in the running
state, for example:
tanzu cluster list
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES PLAN TKR
tkg-vc-antrea default running 3/3 3/3 v1.24.10+vmware.1 <none> prod v1.24.10---vmware.1-tkg.1
For each workload cluster, run tanzu cluster get CLUSTER-NAME
to check that all components are in the running
state, for example:
tanzu cluster get tkg-vc-antrea
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES TKR
tkg-vc-antrea default running 3/3 3/3 v1.24.10+vmware.1 <none> v1.24.10---vmware.1-tkg.1
Details:
NAME READY SEVERITY REASON SINCE MESSAGE
/tkg-vc-antrea True 4h14m
├─ClusterInfrastructure - VSphereCluster/tkg-vc-antrea-s6kl5 True 4h36m
├─ControlPlane - KubeadmControlPlane/tkg-vc-antrea-ch5hn True 4h14m
│ ├─Machine/tkg-vc-antrea-ch5hn-8gfvt True 4h14m
│ ├─Machine/tkg-vc-antrea-ch5hn-vdcrp True 4h23m
│ └─Machine/tkg-vc-antrea-ch5hn-x7nmm True 4h32m
└─Workers
├─MachineDeployment/tkg-vc-antrea-md-0-8b8zn True 4h23m
│ └─Machine/tkg-vc-antrea-md-0-8b8zn-798d5b8897-bnxn9 True 4h24m
├─MachineDeployment/tkg-vc-antrea-md-1-m6dvh True 4h24m
│ └─Machine/tkg-vc-antrea-md-1-m6dvh-79fb858b96-p9667 True 4h28m
└─MachineDeployment/tkg-vc-antrea-md-2-brm2m True 4h21m
└─Machine/tkg-vc-antrea-md-2-brm2m-6478cffc5f-tq5cn True 4h23m
After all workload clusters are running, you can manage the workload clusters with the Tanzu CLI.
Drift cases can be complicated, but a few common patterns and mitigations include:
Stale worker nodes:
Ghost worker node infrastructure:
Mitigation:
kubeconfig
and set it as the kubectl
context.Compare the output of the following kubectl
and tanzu
commands:
# Get the actual worker nodes of the workload cluster
$ kubectl --context tkg-vc-antrea-admin@tkg-vc-antrea get node
NAME STATUS ROLES AGE VERSION
tkg-vc-antrea-md-0-p9vn5-645498f59f-42qh9 Ready <none> 44m v1.24.10+vmware.1
tkg-vc-antrea-md-0-p9vn5-645498f59f-shrpt Ready <none> 114m v1.24.10+vmware.1
tkg-vc-antrea-wdsfx-2hkxp Ready control-plane 116m v1.24.10+vmware.1
# Get the worker nodes managed by the TKG
$ tanzu cluster get tkg-vc-antrea
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES TKR
tkg-vc-antrea default running 1/1 1/1 v1.24.10+vmware.1 <none> v1.24.10---vmware.1-tkg.1-zshippable
Details:
NAME READY SEVERITY REASON SINCE MESSAGE
/tkg-vc-antrea True 13m
├─ClusterInfrastructure - VSphereCluster/tkg-vc-antrea-b7fr9 True 13m
├─ControlPlane - KubeadmControlPlane/tkg-vc-antrea-wdsfx True 13m
│ └─Machine/tkg-vc-antrea-wdsfx-2hkxp True 13m
└─Workers
└─MachineDeployment/tkg-vc-antrea-md-0-p9vn5 True 13m
└─Machine/tkg-vc-antrea-md-0-p9vn5-645498f59f-shrpt True 13m
For each worker node listed by kubectl
that doesn’t have a Workers
> Machine
listing from tanzu cluster get
:
Scale up the workers to the expected value, for example:
tanzu cluster scale ${cluster_name} --worker-machine-count 2
kubeconfig
to drain the ghost node, which moves its workloads to nodes managed by TKG: kubectl drain ${node_name} --delete-emptydir-data --ignore-daemonsets
Remove the ghost node from the cluster:
kubectl delete node ${node_name}
Log in to vSphere or other infrastructure and manually remove the VM.
Stale nodes and ghost infrastructure on control plane
Mitigation:
kubeconfig
and set it as the kubectl
context.Compare the output of the following kubectl
and tanzu
commands:
# Get the actual control plane nodes of the workload cluster
$ kubectl --context wc-admin@wc get node
NAME STATUS ROLES AGE VERSION
wc-2cjn4-4xbf8 Ready control-plane 107s v1.24.10+vmware.1
wc-2cjn4-4zljs Ready control-plane 26h v1.24.10+vmware.1
wc-2cjn4-59v95 Ready control-plane 26h v1.24.10+vmware.1
wc-2cjn4-ncgxb Ready control-plane 25h v1.24.10+vmware.1
wc-md-0-nl928-5df8b9bfbd-nww2w Ready <none> 26h v1.24.10+vmware.1
wc-md-1-j4m55-589cfcd9d6-jxmvc Ready <none> 26h v1.24.10+vmware.1
wc-md-2-sd4ww-7b7db5dcbb-crwdv Ready <none> 26h v1.24.10+vmware.1
# Get the control plane nodes managed by the TKG
$ tanzu cluster get wc
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES TKR
wc default updating 4/3 3/3 v1.24.10+vmware.1 <none> v1.24.10---vmware.1-tkg.1-zshippable
Details:
NAME READY SEVERITY REASON SINCE MESSAGE
/wc True 24m
├─ClusterInfrastructure - VSphereCluster/wc-9nq7v True 26m
├─ControlPlane - KubeadmControlPlane/wc-2cjn4 True 24m
│ ├─Machine/wc-2cjn4-4xbf8 True 24m
│ ├─Machine/wc-2cjn4-4zljs True 26m
│ └─Machine/wc-2cjn4-59v95 True 26m
└─Workers
├─MachineDeployment/wc-md-0-nl928 True 26m
│ └─Machine/wc-md-0-nl928-5df8b9bfbd-nww2w True 26m
├─MachineDeployment/wc-md-1-j4m55 True 26m
│ └─Machine/wc-md-1-j4m55-589cfcd9d6-jxmvc True 26m
└─MachineDeployment/wc-md-2-sd4ww True 26m
└─Machine/wc-md-2-sd4ww-7b7db5dcbb-crwdv True 26m
For each control-plane
node listed by kubectl
that doesn’t have a ControlPlane
> Machine
listing from tanzu cluster get
:
Delete the node:
kubectl delete node ${node_name}