A recovery plan is the automated plan (runbook) for full or partial failover from Region A to Region B.

Recovery Time Objective

The recovery time objective (RTO) is the targeted duration of time and a service level in which a business process must be restored as a result of an IT service or data loss issue, such as a natural disaster.

Decision ID

Design Decision

Design Justification

Design Implication

SDDC-OPS-DR-013

Use Site Recovery Manager and vSphere Replication together to automate the recovery of the following management components:

  • vRealize Operations analytics cluster

  • vRealize Automation Appliance instances

  • vRealize Automation IaaS components

  • vRealize Business Server

  • Provides an automated run book for the recovery of the management components in the event of a disaster.

  • Ensures that the recovery of management applications can be delivered in a recovery time objective (RTO) of 4 hours or less.

None.

Replication and Recovery Configuration between Regions

You configure virtual machines in the Management vCenter Server in Region A to replicate to the Management vCenter Server in Region B such that, in the event of a disaster in Region A, you have redundant copies of your virtual machines. During the configuration of replication between the two vCenter Server instances, the following options are available:

Guest OS Quiescing

Quiescing a virtual machine just before replication helps improve the reliability of recovering the virtual machine and its application(s). However, any solution, including vSphere Replication, that quiesces an operating system and application might impact performance. This is especially true in virtual machines that generate higher levels of I/O and where quiescing occurs often.

Network Compression

Network compression can be defined for each virtual machine to further reduce the amount of data transmitted between source and target locations.

Recovery Point Objective

The recovery point objective (RPO) is configured per virtual machine. RPO defines the maximum acceptable age that the data stored and recovered in the replicated copy (replica) as a result of an IT service or data loss issue, such as a natural disaster, can have. The lower the RPO, the closer the replica's data is to the original. However, lower RPO requires more bandwidth between source and target locations, and more storage capacity in the target location.

Point-in-Time Instance

You define multiple recovery points (point-in-time instances or PIT instances) for each virtual machine so that, when a virtual machine has data corruption, data integrity or host OS infections, administrators can recover and revert to a recovery point before the compromising issue occurred.

Table 1. Design Decisions about vSphere Replication

Decision ID

Design Decision

Design Justification

Design Implication

SDDC-OPS-DR-014

Do not enable guest OS quescing on the management virtual machine policies in vSphere Replication.

Not all management virtual machines support the use of guest OS quiescing. Using the quiescing operation might result in an outage.

The replicas of the management virtual machines that are stored in the target region are crash-consistent rather than application-consistent.

SDDC-OPS-DR-015

Enable network compression on the management virtual machine policies in vSphere Replication.

  • Ensures the vSphere Replication traffic over the network has a reduced footprint.

  • Reduces the amount of buffer memory used on the vSphere Replication VMs.

To perform compression and decompression of data, vSphere Replication VM might require more CPU resources on the source site as more virtual machines are protected.

SDDC-OPS-DR-016

Enable a recovery point objective (RPO) of 15 minutes on the management virtual machine policies in vSphere Replication.

  • Ensures that the management application that is failing over after a disaster recovery event contains all data except any changes prior to 15 minutes of the event.

  • Achieves the availability and recovery target of 99% of this VMware Validated Design.

Any changes that are made up to 15 minutes before a disaster recovery event will be lost.

SDDC-OPS-DR-017

Enable point-in-time (PIT) instances, keeping 3 copies over a 24-hour period on the management virtual machine policies in vSphere Replication.

Ensures application integrity for the management application that is failing over after a disaster recovery event occurs.

Increasing the number of retained recovery point instances increases the disk usage on the vSAN datastore.

Startup Order and Response Time

Virtual machine priority determines virtual machine startup order.

  • All priority 1 virtual machines are started before priority 2 virtual machines.

  • All priority 2 virtual machines are started before priority 3 virtual machines.

  • All priority 3 virtual machines are started before priority 4 virtual machines.

  • All priority 4 virtual machines are started before priority 5 virtual machines.

  • You can additionally set startup order of virtual machines within each priority group.

You can configure the following timeout parameters:

  • Response time, which defines the time to wait after the first virtual machine powers on before proceeding to the next virtual machine in the plan.

  • Maximum time to wait if the virtual machine fails to power on before proceeding to the next virtual machine.

You can adjust response time values as necessary during execution of the recovery plan test to determine the appropriate response time values. 

Table 2. Startup Order Design Decisions for Site Recovery Manager

Decision ID

Design Decision

Design Justification

Design Implication

SDDC-OPS-DR-018

Use a prioritized startup order for vRealize Operations Manager nodes.

  • Ensures that the individual nodes in the vRealize Operations Manager analytics cluster are started in such an order that the operational monitoring services are restored after a disaster.

  • Ensures that the vRealize Operations Manager services are restored in the target of 4 hours.

  • You must have VMware Tools running on each vRealize Operations Manager node.

  • You must maintain the customized recovery plan if you increase the number of analytics nodes in the vRealize Operations Manager cluster.

SDDC-OPS-DR-019

Use a prioritized startup order for vRealize Automation and vRealize Business nodes.

  • Ensures that the individual nodes within vRealize Automation and vRealize Business are started in such an order that cloud provisioning and cost management services are restored after a disaster.

  • Ensures that the vRealize Automation and vRealize Business services are restored within the target of 4 hours.

  • You must have VMware Tools installed and running on each vRealize Automation and vRealize Business node.

  • You must maintain the customized recovery plan if you increase the number of nodes in vRealize Automation.

Recovery Plan Test Network

When you create a recovery plan, you must configure test network options as follows:

Isolated Network

Automatically created. For a virtual machine that is being recovered, Site Recovery Manager creates an isolated private network on each ESXi host in the cluster. Site Recovery Manager creates a standard switch and a port group on it.

A limitation of this automatic configuration is that a virtual machine that is connected to the isolated port group on one ESXi host cannot communicate with a virtual machine on another ESXi host. This option limits testing scenarios and provides an isolated test network only for basic virtual machine testing.

Port Group

Selecting an existing port group provides a more granular configuration to meet your testing requirements. If you want virtual machines across ESXi hosts to communicate, use a standard or distributed switch with uplinks to the production network, and create a port group on the switch that is has tagging with a non-routable VLAN enabled. In this way, you isolate the network and it cannot communicate with other production networks.

Because the application virtual networks for failover are fronted by a load balancer, the recovery plan test network is equal to the recovery plan production network and provides realistic verification of a recovered management application.

Table 3. Recovery Plan Test Network Design Decision

Decision ID

Design Decision

Design Justification

Design Implication

SDDC-OPS-DR-020

Use the target recovery production network for testing.

The design of the application virtual networks supports their use as recovery plan test networks.

During recovery testing, a management application will not be reachable using its production FQDN.  Access the application using its VIP address or assign a temporary FQDN for testing. Note that this approach results in certificate warnings because of mismatch between the assigned temporary host name and the host name in the certificate.