A recovery plan is the automated plan (runbook) for full or partial failover from Region A to Region B.
Recovery Time Objective
The recovery time objective (RTO) is the targeted duration of time and a service level in which a business process must be restored as a result of an IT service or data loss issue, such as a natural disaster.
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
SDDC-OPS-DR-018 |
Use Site Recovery Manager and vSphere Replication together to automate the recovery of the following management components:
|
|
None. |
Replication and Recovery Configuration between Regions
You configure virtual machines in the Management vCenter Server in Region A to replicate to the Management vCenter Server in Region B such that, in the event of a disaster in Region A, you have redundant copies of your virtual machines. During the configuration of replication between the two vCenter Server instances, the following options are available:
- Guest OS Quiescing
- Quiescing a virtual machine just before replication helps improve the reliability of recovering the virtual machine and its applications. However, any solution, including vSphere Replication, that quiesces an operating system and application might impact performance. For example, such an impact could appear in virtual machines that generate higher levels of I/O and where quiescing occurs often.
- Network Compression
- Network compression can be defined for each virtual machine to further reduce the amount of data transmitted between source and target locations.
- Recovery Point Objective
- The recovery point objective (RPO) is configured per virtual machine. RPO defines the maximum acceptable age that the data stored and recovered in the replicated copy (replica) as a result of an IT service or data loss issue, such as a natural disaster, can have. The lower the RPO, the closer the replica's data is to the original. However, lower RPO requires more bandwidth between source and target locations, and more storage capacity in the target location.
- Point-in-Time Instance
- You define multiple recovery points (point-in-time instances or PIT instances) for each virtual machine so that, when a virtual machine has data corruption, data integrity or host OS infections, administrators can recover and revert to a recovery point before the compromising issue occurred.
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
SDDC-OPS-DR-019 |
Do not enable guest OS quiescing in the policies for the management virtual machines in vSphere Replication. |
Not all management virtual machines support the use of guest OS quiescing. Using the quiescing operation might result in an outage. |
The replicas of the management virtual machines that are stored in the target region are crash-consistent rather than application-consistent. |
SDDC-OPS-DR-020 |
Enable network compression on the management virtual machine policies in vSphere Replication. |
|
To perform compression and decompression of data, vSphere Replication VM might require more CPU resources on the source site as more virtual machines are protected. |
SDDC-OPS-DR-021 |
Enable a recovery point objective (RPO) of 15 minutes on the management virtual machine policies in vSphere Replication. |
|
Any changes that are made up to 15 minutes before a disaster recovery event are lost. |
SDDC-OPS-DR-022 |
Enable point-in-time (PIT) instances, keeping 3 copies over a 24-hour period on the management virtual machine policies in vSphere Replication. |
Ensures application integrity for the management application that is failing over after a disaster recovery event occurs. |
Increasing the number of retained recovery point instances increases the disk usage on the vSAN datastore. |
Startup Order and Response Time
Virtual machine priority determines the virtual machine startup order.
- All priority 1 virtual machines are started before priority 2 virtual machines.
- All priority 2 virtual machines are started before priority 3 virtual machines.
- All priority 3 virtual machines are started before priority 4 virtual machines.
- All priority 4 virtual machines are started before priority 5 virtual machines.
- You can also set the startup order of virtual machines within each priority group.
You can configure the following timeout parameters:
- Response time, which defines the time to wait after the first virtual machine powers on before proceeding to the next virtual machine in the plan.
- Maximum time to wait if the virtual machine fails to power on before proceeding to the next virtual machine.
You can adjust response time values as necessary during execution of the recovery plan test to determine the appropriate response time values.
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
SDDC-OPS-DR-023 |
Use a prioritized startup order for vRealize Operations Manager and vRealize Suite Lifectycle Manager nodes. |
|
|
SDDC-OPS-DR-024 |
Use a prioritized startup order for vRealize Automation and vRealize Business nodes. |
|
|
Recovery Plan Test Network
When you create a recovery plan, you must configure test network options as follows:
- Isolated Network
-
Automatically created. For a virtual machine that is being recovered, Site Recovery Manager creates an isolated private network on each ESXi host in the cluster. Site Recovery Manager creates a standard switch and a port group on it.
A limitation of this automatic configuration is that a virtual machine that is connected to the isolated port group on one ESXi host cannot communicate with a virtual machine on another ESXi host. This option limits testing scenarios and provides an isolated test network only for basic virtual machine testing.
- Port Group
- Selecting an existing port group provides a more granular configuration to meet your testing requirements. If you want virtual machines across ESXi hosts to communicate, use a standard or distributed switch with uplinks to the production network, and create a port group on the switch that has tagging with a non-routable VLAN enabled. In this way, you isolate the network. It is not connected to other production networks.
Because the application virtual networks for failover are fronted by a load balancer, as recovery plan test network you can use the recovery plan production network and provides realistic verification of a recovered management application.
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
SDDC-OPS-DR-025 |
Use the target recovery production network for testing. |
The design of the application virtual networks supports their use as recovery plan test networks. This allows the re-use of existing networks. |
During recovery testing, a management application is not reachable using its production FQDN. Access the application using its VIP address or assign a temporary FQDN for testing. Note that this approach results in certificate warnings because the assigned temporary host name and the host name in the certificate mismatch. |