Overview of Backing Up and Restoring TKGI

This topic describes the back up and restore process for VMware Tanzu Kubernetes Grid Integrated Edition (TKGI), and provides high-level considerations for implementing your back up and restore strategy for TKGI.

Layers and Tools

TKGI back up and restore comprises various layers and tools. The table summarizes these layers and tools from the top-down, that is, workload to infrastructure. Refer to the individual sections for details on performing back up and restore for each layer, and additional details on test scenarios.

Layer	Tools	Comments
Backup and Restore Kubernetes Workloads	Velero	Load balancer and ingress services depend on NSX backup.
Backup and Restore Kubernetes Clusters	BOSH Backup and Restore (BBR)	Use BBR to back up and restore Kubernetes clusters provisioned by TKGI, including the control plane nodes, etcd database, and worker nodes.
Backup and Restore TKGI Components	Ops Manager, BBR	Use VMware Tanzu Operations Manager (Ops Manager) to back up and restore the BOSH Director and TKGI tile configurations. Use BBR to backup and restore the TKGI Management Plane virtual machines, including BOSH Director, TKGI Control Plane, and TKGI DB.
Backup and Restore TKGI Infrastructure	NSX Manager, vCenter Server	Use the NSX Manager UI or CLI to backup and restore the NSX Manager DB. Use vCenter Server to backup and restore vCenter objects.

Considerations

When planning your backup and restore strategy and implementation for TKGI, keep in mind the following considerations.

Develop a Back Up and Restore Plan, and Test It Often

This documentation provides guidelines for implementing a robust back up and restore practice for TKGI. You must develop a plan of your own on how you intend to implement these tools and guidelines for your TKGI foundation. In addition, you must test the back up and restore of each layer of your TKGI foundation to ensure that the components and workloads are properly backed up and restored.

Back Up Frequently and Regularly

You can only restore what is backed up. Back up key resources, such as stateful applications, cluster configurations, NSX objects on a frequent and regular schedule, such as once every 24 hours.

Back Up Critical Components

Make sure you promptly back up critical components, including:

When you deploy an application, take a backup of the application using Velero.
When you provision a Kubernetes cluster using TKGI, take a backup of the cluster using BBR and networking objects using NSX.
When you deploy, upgrade, or update TKGI components, take a backup of Ops Manager and the TKGI Management Plane using BBR.
When you create one or more of the following NSX objects, take a backup using NSX.
- Load balancer
- Namespace
- Logical switch (created when a TKGI cluster is provisioned)
- Ingress
- Network Policy

Back Up and Restore One Kubernetes Namespace at a Time

For optimal performance and assurance, only back up one Kubernetes namespace at a time using Velero. Likewise, restore only one Kubernetes namespace at a time using Velero.

Restore What Breaks

The general approach is to restore what breaks. For example, if NSX crashes, you only need to restore NSX. If Ops Manager breaks, restore Ops Manager.

The exception is a Kubernetes cluster. If the cluster breaks, you need to restore the cluster using BBR and the applications using Velero. On the NSX side, you need to create a new namespace for the restored cluster. Once the cluster is restored, use kubectl to delete the old namespace. This will force the creation of the NSX objects in the namespace. Refer to the scenarios for the TKGI cluster back up and restore for more details.

Understand What Is Restored at Each Layer

Because there are several layers and tools involved in the back up and restore of TKGI, it can be confusing what is being backed up and restored within each layer.

Velero is used for backing up and restoring Kubernetes workloads. Use Velero restore if something breaks in a workload application, and you need to restore it.

BBR is used to back up and restore Kubernetes clusters provisioned by TKGI. This includes the cluster nodes and etcd database. BBR might also restore stateless applications, depending on when the backup of the cluster was taken, but do not rely on it for such purposes. Use Velero for workload back up and restore.

For BBR to work restoring a cluster, the NSX objects need to be in the NSX database. BBR will only work if the objects are in NSX. BBR recreates the VM, not the logical switch. You need the logical switch to be present for each Kubernetes cluster.

If NSX crashes, you only need to restore NSX. Applications will continue to run, but you can’t add anything.