A recovery plan is the automated plan (runbook) for full or partial failover from the protected to the recovery VMware Cloud Foundation instance.

Recovery Time Objective

The recovery time objective (RTO) is the targeted duration of time and a service level in which a business process must be restored as a result of an IT service or data loss issue, such as a natural disaster.

Table 1. Design Decisions on the Configuration of Protected Management Components

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-CFG-005​

Use Site Recovery Manager and vSphere Replication together to automate the recovery of the following management components:

  • VMware Aria Suite Lifecycle appliance

  • Clustered Workspace ONE Access

  • VMware Aria Operations

    analytics cluster

  • VMware Aria Automation appliance instances

  • Provides an automated run book for the recovery of the management components in the event of a disaster.

  • Ensures that the recovery of management applications can be delivered in a recovery time objective (RTO) of 4 hours or less.

None.

Replication and Recovery Configuration between VMware Cloud Foundation instances

You configure virtual machines in the management domain vCenter Server in the protected VMware Cloud Foundation instance to replicate to the management domain vCenter Server in the recovery VMware Cloud Foundation instance such that, in the event of a disaster of the protected VMware Cloud Foundation instance, you have redundant copies of your virtual machines. During the configuration of the replication between the two vCenter Server instances, the following options are available:

Guest OS Quiescing

Quiescing a virtual machine just before replication helps improve the reliability of recovering the virtual machine and its applications. However, any solution, including vSphere Replication, that quiesces an operating system and application might impact performance. For example, such an impact could appear in virtual machines that generate higher levels of I/O and where quiescing occurs often.

Network Compression

Network compression can be defined for each virtual machine to further reduce the amount of data transmitted between source and target locations.

Recovery Point Objective

The recovery point objective (RPO) is configured per virtual machine. RPO defines the maximum acceptable age that the data stored and recovered in the replicated copy (replica) as a result of an IT service or data loss issue, such as a natural disaster, can have. The lower the RPO, the closer the replica's data is to the original. However, lower RPO requires more bandwidth between source and target locations, and more storage capacity in the target location.

Point-in-Time Instance

You define multiple recovery points (point-in-time instances or PIT instances) for each virtual machine so that, when a virtual machine has data corruption, data integrity or host OS infections, administrators can recover and revert to a recovery point before the compromising issue occurred.

Table 2. Design Decisions on vSphere Replication Configuration

Decision ID

Design Decision

Design Justification

Design Implication

SPR-VR-CFG-003

Do not activate guest OS quiescing in the policies for the management virtual machines in vSphere Replication.

Not all management virtual machines support the use of guest OS quiescing. Using the quiescing operation might result in an outage.

The replicas of the management virtual machines that are stored in the target VMware Cloud Foundation instance are crash-consistent rather than application-consistent.

SPR-VR-CFG-004

Activate network compression on the management virtual machine policies in vSphere Replication.

  • Ensures the vSphere Replication traffic over the network has a reduced footprint.

  • Reduces the amount of buffer memory used on the vSphere Replication VMs.

To perform compression and decompression of data, vSphere Replication VM might require more CPU resources on the source site as more virtual machines are protected.

SPR-VR-CFG-005

Configure a recovery point objective (RPO) of 15 minutes on the management virtual machine policies in vSphere Replication.

  • Ensures that the management application that is failing over after a disaster recovery event contains all data except any changes prior to 15 minutes of the event.

Any changes that are made up to 15 minutes before a disaster recovery event are lost.

SPR-VR-CFG-006

Configure point-in-time (PIT) instances, keeping 3 copies over a 24-hour period on the management virtual machine policies in vSphere Replication.

Ensures application integrity for the management application that is failing over after a disaster recovery event occurs.

Increasing the number of retained recovery point instances increases the disk usage on the vSAN datastore.

Startup Order and Response Time

Virtual machine priority determines the virtual machine startup order.

  • All priority 1 virtual machines are started before priority 2 virtual machines.

  • All priority 2 virtual machines are started before priority 3 virtual machines.

  • All priority 3 virtual machines are started before priority 4 virtual machines.

  • All priority 4 virtual machines are started before priority 5 virtual machines.

  • You can also set the startup order of virtual machines within each priority group.

You can configure the following timeout parameters:

  • Response time, which defines the time to wait after the first virtual machine powers on before proceeding to the next virtual machine in the plan.

  • Maximum time to wait if the virtual machine fails to power on before proceeding to the next virtual machine.

You can adjust response time values as necessary during execution of the recovery plan test to determine the appropriate response time values. 

Table 3. Design Decisions on the Startup Order Configuration in Site Recovery Manager

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-RP-001

Use a prioritized startup order for VMware Aria Suite Lifecycle and the clustered Workspace ONE Access nodes.

  • Ensures that the VMware Aria Suite Lifecycle is started in such an order that the life cycle management services are restored after a disaster.

  • Ensures that the clustered Workspace ONE Access is started in such an order that the authentication services are restored after a disaster.

  • Ensures that the VMware Aria Suite Lifecycle, and the clustered Workspace ONE Access services are restored in the target of 4 hours.

You must have VMware Tools running on VMware Aria Suite Lifecycle and each of the clustered Workspace ONE Access nodes.

SPR-SRM-RP-002

Use a prioritized startup order for VMware Aria Operations analytics cluster nodes.

Ensures that the individual nodes in the VMware Aria Operations analytics cluster are started in such an order that the operational monitoring services are restored after a disaster.

  • You must have VMware Tools running on the VMware Aria Operations analytics cluster nodes.

  • You must maintain the customized recovery plan if you increase the number of analytics nodes in the VMware Aria Operations cluster.

SPR-SRM-RP-003

Use a prioritized startup order for VMware Aria Operations remote collector nodes.

  • Ensures that the VMware Aria Operations remote collectors are started in such an order that the operational monitoring services are restored after a disaster.

  • You must have VMware Tools running on the VMware Aria Operations analytics cluster nodes.

  • You must maintain the customized recovery plan if you increase the number of remote collector nodes in the VMware Aria Operations cluster.

SPR-SRM-RP-004

Use a prioritized startup order for VMware Aria Automation nodes.

  • Ensures that the individual nodes within VMware Aria Automation are started in such an order that cloud automation services are restored after a disaster.

  • Ensures that the VMware Aria Automation services are restored within the target of 4 hours.

You must have VMware Tools installed and running on each VMware Aria Automation node.

Testing a Recovery Plan

When you create a recovery plan, you must configure test network options as follows:

Isolated Network

Automatically created. For a virtual machine that is being recovered, Site Recovery Manager creates an isolated private network on each ESXi host in the cluster. Site Recovery Manager creates a standard switch and a port group on it.

A limitation of this automatic configuration is that a virtual machine that is connected to the isolated port group on one ESXi host cannot communicate with a virtual machine on another ESXi host. This option limits testing scenarios and provides an isolated test network only for basic virtual machine testing.

Port Group

Selecting an existing port group provides a more granular configuration to meet your testing requirements. If you want virtual machines across ESXi hosts to communicate, use a standard or distributed switch with uplinks to the production network, and create a port group on the switch that has tagging with a non-routable VLAN activated. In this way, you isolate the network. It is not connected to other production networks.

Because the protected applications use an NSX load balancer, it is not possible to bring the applications online in an isolated test network. Recovery plan tests are therefore out of scope of this design.

Table 4. Design Decisions on Testing Recovery

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-RP-005

Do not run test recovery of recovery plans.

Because the protected applications use an NSX load balancer, it is not possible to bring the applications online in an isolated test network.

DNS resolution is also unavailable in an isolated test network.

You cannot test disaster recovery without impacting the running production applications.