Overview of Backing Up and Restoring TKGI

This topic provides an overview of the backup and restore process for TKGI, and provides high-level considerations for implementing your backup and restore strategy for TKGI.

Layers and Tools

TKGI backup and restore comprises various layers and tools. The table summarizes these layers and tools from the top-down, that is, workload to infrastructure. Refer to the individual sections for details on performing backup and restore for that layer, as well as additional details on test scenarios.

Layer	Tool(s)	Comments
Backup and Restore Kubernetes Workloads	Velero, Restic	Use Velero for stateless applications and Velero with Restic for stateful applications. Load balancer and ingress services depend on NSX-T backup.
Backup and Restore Kubernetes Clusters	BOSH Backup and Restore (BBR)	Use BBR to back up and restore Kubernetes clusters provisioned by TKGI, including the control plane nodes, etcd database, and worker nodes.
Backup and Restore TKGI Components	Ops Manager, BBR	Use Ops Manager to back up and restore the BOSH Director and TKGI tile configurations. Use BBR to backup and restore the TKGI Management Plane virtual machines, including BOSH Director, TKGI Control Plane, and TKGI DB.
Backup and Restore TKGI Infrastructure	NSX-T Manager, vCenter Server	Use the NSX Manager UI or CLI to backup and restore the NSX Manager DB. Use vCenter Server to backup and restore vCenter objects.

Considerations

When planning your backup and restore strategy and implementation for TKGI, keep in mind the following considerations.

Develop a Backup and Restore Plan, and Test It Often

This documentation provides guidelines for implementing a robust backup and restore practice for TKGI. You must develop a plan of your own on how you intend to implement these tools and guidelines for your TKGI foundation. In addition, you must test the backup and restore of each layer of your TKGI foundation to ensure that the components and workloads are properly backed up and restored.

Back Up Frequently and Regularly

You can only restore what is backed up. Key resources, such as stateful applications, cluster configurations, NSX-T objects, should be backed up on a frequent and regular schedule, such as once every 24 hours.

Back Up Critical Components

Make sure you promptly back up critical components, including:

When you deploy an application, take a backup of the application using Velero and Restic.
When you provision a Kubernetes cluster using TKGI, take a backup of the cluster using BBR and networking objects using NSX-T.
When you deploy, upgrade, or update TKGI components, take a backup of Ops Manager and the TKGI Management Plane using BBR.
When you create one or more of the following NSX-T objects, take a backup using NSX-T.
- Load balancer
- Namespace
- Logical switch (created when a TKGI cluster is provisioned)
- Ingress
- Network Policy

Backup and Restore One Kubernetes Namespace at a Time

For optimal performance and assurance, only back up one Kubernetes namespace at a time using Velero. Likewise, you should only restore one Kubernetes namespace at a time using Velero.

Restore What Breaks

The general approach is to restore what breaks. For example, if NSX-T crashes, you only need to restore NSX-T. If Ops Manager breaks, restore Ops Manager.

The exception is a Kubernetes cluster. If the cluster breaks, you need to restore the cluster using BBR and the applications using Velero. On the NSX-T side, you need to create a new namespace for the restored cluster. Once the cluster is restored, use kubectl to delete the old namespace. This will force the creation of the NSX-T objects in the namespace. Refer to the scenarios for the TKGI cluster backup and restore for more details.

Understand What Is Restored at Each Layer

Because there are several layers and tools involved in the backup and restore of TKGI, it can be confusing what is being backed up and restored within each layer.

Velero is used for backing up and restoring Kubernetes workloads. Restic is used with Velero for backup and restore of stateful workloads. Use Velero restore if something breaks in a workload application, and you need to restore it.

BBR is used to backup and restore Kubernetes clusters provisioned by TKGI. This includes the cluster nodes and etcd database. BBR may also restore stateless applications, depending on when the backup of the cluster was taken, but you shouldn’t rely on it for such purposes. Use Velero for workload backup and restore.

For BBR to work restoring a cluster, the NSX-T objects need to be in the NSX-T database. BBR will only work if the objects are in NSX-T. BBR recreates the VM, not the logical switch. You need the logical switch to be present for each Kubernetes cluster.

If NSX-T crashes, you only need to restore NSX-T. Applications will continue to run, but you can’t add anything.