Disaster recovery

There are a range of approaches for ensuring that you can recover your deployment, apps, and data in case of a disaster.

The approaches fall into two categories:

Using data from a backup to restore the data in the deployment. For more information, see Back up and restore using BOSH Backup and Restore (BBR).
Recreating the data in the deployment by automating the creation of state. For more information, see Disaster recovery by recreating the deployment.

Back up and restore using BOSH Backup and Restore (BBR)

BOSH Backup and Restore (BBR) is a CLI for orchestrating backing up and restoring BOSH deployments and BOSH Directors. BBR triggers the backup or restore process on the deployment or BOSH Director, and transfers the backup artifact to and from the deployment or BOSH Director.

Use BBR to reliably create backups of core components of your deployments and their data. These core components include CredHub, UAA, BOSH Director, and VMware Tanzu Application Service for VMs (TAS for VMs).

Each component includes its own backup scripts. This decentralized structure helps keep scripts synchronized with the components. At the same time, locking features ensure data integrity and consistent, distributed backups across your deployment.

For more information about the BBR framework, see BOSH Backup and Restore in the open-source Cloud Foundry documentation.

Backing up your deployment

Backing up your deployment requires backing up the following components:

Tanzu Operations Manager settings
BOSH Director, including CredHub and UAA
TAS for VMs
Data services

For more information, see Backing up Deployments with BBR. With these backup artifacts, operators can re-create their deployment exactly as it was when the backup was taken.

Restoring your deployment

The restore process involves creating a new deployment starting with the Tanzu Operations Manager VM. For more information, see Restoring Deployments from Backup with BBR.

The time required to restore the data is proportionate to the size of the data because the restore process includes copying data. For example, restoring a 1 TB blobstore takes 1,000 times as long as restoring a 1 GB blobstore.

Benefits

Unlike other backup solutions, using BBR to back up your deployment enables:

Completeness: BBR supports backing up BOSH, including CredHub, UAA, and service instances created with an on-demand service broker.
Consistency: BBR provides referential integrity between the database and the blobstore because a lock is held while both the database and blobstore are backed up.
Correctness: Using the BBR restore flow addresses container-to-container networking and routing issues that can occur during restore.

API downtime during backups

Apps are not affected during backups, but certain APIs are unavailable. The downtime occurs only while the backup is being taken, not while the backup is being copied to the jumpbox.

In a consistent backup, the blobs in the blobstore match the blobs in the Cloud Controller database. To ensure a consistent backup, changes to the data are prevented during the backup. This means that the Cloud Foundry API (CAPI), Routing API, Usage service, Autoscaler, Notification Service, Network Policy Server, and CredHub are unavailable while the backup is being taken. UAA is in read-only mode during the backup.

Backup timings

The first three phases of the backup are lock, backup, and unlock. During this time, the API is unavailable. The drain and checksum phase starts after the backup scripts finish. BBR downloads the backup artifacts from the instances to the BBR VM, and performs a checksum to ensure the artifacts are not corrupted. The size of the blobstore significantly influences backup time.

The following table provides an indication of the downtime that you can expect. Actual downtime varies based on hardware and configuration. These example timings were recorded with TAS for VMs deployed on Google Cloud Platform (GCP) with all components scaled to one and only one app pushed.

Disaster Recovery
API state	Backup phase	Duration for external versioned S3-compatible blobstore	Duration for external unversioned S3-compatible blobstore	Duration for internal blobstore
API unavailable	lock	15 seconds	15 seconds	15 seconds
	backup	<30 seconds	Proportional to blobstore size	10 seconds
	unlock	3 minutes	3 minutes	3 minutes
API available	drain and checksum	<10 seconds	<10 seconds	Proportional to blobstore size

Blobstore backup and restore

Blobstores can be very large. To minimize downtime, BBR only takes blob metadata during the backup. For example, in the case of internal blobstores such as WebDav and NFS, BBR takes a list of hard links that point to the blobs. After the API becomes available, BBR makes copies of the blobs.

Unsupported products

Data services: The following data services do not support BBR. Operators of these services can use the automatic backups feature of each tile, available within Tanzu Operations Manager.
- VMware Tanzu SQL [MySQL]
- VMware Tanzu Gemfire
- RabbitMQ for VMware Tanzu [VMs]
- VMware Tanzu Valkey on Cloud Foundry (formerly Redis for VMware Tanzu)
External blobstores and databases: BBR support for backing up and restoring external databases and blobstores varies across Tanzu Operations Manager versions. For more information, see Supported Components and External Storage Support Across Tanzu Operations Manager Versions in Backing up Deployments with BBR.

Best practices

This section describes best practices for backing up your deployment.

Frequency of backups

VMware recommends that you take backups in proportion to the rate of change of the data in your deployment to minimize the number of changes lost if a restore is required. VMware recommends starting with backing up every 24 hours. If app developers make frequent changes, you can increase the frequency of backups.

Retention of backup artifacts

You must retain backup artifacts based on the timeframe you need to be able to restore to. For example, if backups are taken every 24 hours and the deployment must be able to be restored to three days prior, three sets of backup artifacts must be retained.

Artifacts can be stored in two data centers other than the data center. When deciding the restore timeframe, you can take into account other factors, such as compliance and auditability.

Security

VMware recommends that you encrypt artifacts and store them securely.

Disaster Recovery by re-creating the deployment

An alternative strategy for recovering your deployment after a disaster is to have automation in place so that all the data can be re-created. This requires that every modification to settings and state is automated, typically through the use of a pipeline.

Recovery steps include creating a new deployment, re-creating orgs, spaces, users, services, service bindings and other state, and re-pushing apps.

For more information about this approach, see the Cloud Foundry Summit presentation Multi-DC Cloud Foundry: What, Why and How? on YouTube.

Disaster Recovery for different topologies

This section describes disaster recovery strategies for different deployment topologies.

Active-Active

To prevent app downtime, some customers run an active-active deployment, where they run two or more identical deployments in different data centers. If one deployment becomes unavailable, traffic is seamlessly routed to the other deployment. To achieve identical deployments, all operations are automated so they can be applied to both deployments in parallel.

Because all operations have been automated, the automation approach to disaster recovery is a viable option for an active-active deployment. Disaster recovery requires re-creating the deployment, then running all the automation to re-create state.

This option requires discipline to automate all changes to the deployment. Some of the operations that need to be automated are:

App push, restage, scale
Org, space, and user create, read, update, and delete (CRUD)
Service instance CRUD
Service bindings CRUD
Routes CRUD
Security groups CRUD
Quota CRUD

Human-initiated changes always make their way into the system. These changes can include quotas being raised, new settings being enabled, and incident responses. For this reason, VMware recommends taking backups even when using an automated disaster recovery strategy.

Using BOSH Backup and Restore versus recreating a failed deployment in active-active

The following table compares backing up and restoring using BBR to recreating a failed deployment in active-active:

Disaster Recovery
	Restore the data	Re-create the data
Preconditions	IaaS prepared for Tanzu Operations Manager and runtime install
Steps	Re-create the deployment Restore Apply changes to make the restored deployment match the other active deployment	Re-create the deployment Trigger automation to re-create orgs, spaces, etc. Notify app developers to re-push apps, re-create service instances and bindings
RTO (recovery time objective)
Platform	Time to re-create the deployment	Time to re-create the deployment
Apps	Time to restore	Time until orgs, spaces, etc. have been re-created and apps have been re-pushed
RPO (recovery point objective)
Platform	Time of the last backup	Current time
Apps	Time of the last backup	Current time

Active-Passive

Instead of having a true active-active deployment across all layers, some customers prefer to install a deployment on a backup site. The backup site resides on-premises, in a co-location facility, or the public cloud. The backup site includes an operational deployment, with only the most critical apps ready to accept traffic if a failure occurs in the primary data center. Disaster recovery in this scenario involves:

Switching traffic to the passive deployment, making it active.
Recovering the formerly-active deployment. Operators can choose to do this through automation, if that option is available, or by using BBR and the restore process.

The RTO and RPO for re-creating the active deployment are the same as outlined in the preceding table.

Reducing RTO

Both the restore and re-create data disaster recovery options require standing up a new deployment, which can take hours. If you require shorter RTO, several options involving a pre-created standby hardware are available. The following table outlines these options:

Active-cold
Active-warm	The deployment installed on standby hardware and kept up to date, VMs scaled down to zero (that you spin up each time there is a deployment update), no apps installed, no orgs or spaces defined.
Active-inflate platform	Bare-minimum deployment installation, either with no apps, or a small number of each app in a stopped state. On recovery, push a small number of apps or start current apps, while simultaneously triggering automation to scale the deployment to the primary node size, or a smaller version if large percentages of loss are acceptable. This mode allows you to start sending some traffic immediately, while not paying for a full non-primary deployment. This method requires data to be seeded, but it is usually acceptable to complete data sync while the deployment is scaling up.
Active-inflate apps	Non-primary deployment scaled to the primary node size, or smaller version if large percentages of loss are acceptable, with a small number of Diego Cells (VMs). On fail-over, scale Diego Cells up to primary node counts. This mode allows you to start sending most traffic immediately, while not paying for all the AIs of a fully-fledged node. This method requires data to be there very quickly after failure. It does not require real-time sync, but near-real time.

There is a trade-off between cost and RTO: the less the replacement deployment needs to be deployed and scaled, the faster the restore.

Automating backups

BBR generates the required backup artifacts, but does not handle scheduling, artifact management, or encryption. You can use the starter Concourse pipeline from the BBR PCF Pipeline Tasks repository on GitHub to automate backups with BBR.

You can also use Stark & Wayne’s Shield as a front end management tool using the BBR plug-in.

Validating backups

To ensure that backup artifacts are valid, the BBR tool creates checksums of the generated backup artifacts, and ensures that the checksums match the artifacts on the jumpbox.

However, the only way to be sure that the backup artifact can be used to successfully re-create the deployment is to test it in the restore process. This is a cumbersome, dangerous process that must be done with care. For instructions, see Step 11: (Optional) Validate your backup in Backing up Deployments with BBR.