HCX Disaster Recovery service operation requires planning for the amount of storage consumed at the target location.
HCX Protection Workflow
Replication based operations such as HCX Bulk Migration, Replication Assisted vMotion and HCX Disaster Recovery use the vSphere Replication technologies to transfer virtual machine disk data. When a virtual machine protection operation is first run, the replication engine performs a full synchronization of all the data that makes up the virtual machine to the target location datastore. Following that baseline synchronization, the system performs a delta synchronization, meaning that only changed data blocks are replicated.
Delta synchronization occurs based on the recovery point objective (RPO) interval configured for the virtual machine, creating a replication instance. The selectable RPO ranges from 5 minutes to 24 hours. For example, setting the Recovery Point Objective (RPO) to 2 hours means that the maximum data loss that your organization can tolerate is 2 hours.
Setting an RPO does not mean replication occurs on a specific interval. A replication instance reflects the state of a virtual machine at the time the synchronization starts. The system schedules replications so that the RPO is not violated. For example, assuming a 15 minute RPO, if the synchronization starts at 12:00 and it takes five minutes to transfer to the target site, the instance becomes available on the target site at 12:05. That instance reflects the state of the virtual machine at 12:00. The next synchronization can start no later than 12:10 so that instance is available no later than 12:15.
To determine the replication transfer time, the replication scheduler uses the duration of the last few instances to estimate the next one.
Following a full synchronization, the HCX DR service prompts you to run a test recovery operation to verify the replication.
Using Snapshots with HCX DR Protection
HCX allows for multiple recovery points, or replica instances, which are converted to snapshots when you recover a virtual machine. You set a retention policy for these instances by configuring a snapshot interval along with the number of snapshots to retain for each protected virtual machine. Snapshot intervals range from 1 hour to 7 days. The maximum number of snapshots taken during that interval can range from 1 to 24. For example, setting the number of snapshots to 4 and the snapshot interval to 1 day, means you can restore that virtual machine to any of 4 recovery points over the past 24 hours. In another example, setting the number of snapshots to 24 and the snapshot interval to 3-hours results in 8 snapshots per day for 4 days.
The RPO interval and snapshot interval may not be the same. Snapshots are taken from the latest replication instance based on the RPO. The RPO must be set low enough to create the number of configured snapshots. For example, setting a retention policy of 6 snapshots per day means the RPO period must not exceed 4 hours to create at least 6 replication instances in 24 hours.
With snapshots, delta synchronizations are written to a new (replica) disk created for the snapshot in the same datastore as the baseline. Each new snapshot becomes the child of the previous version. For example, the first snapshot (replica 1) becomes the child and the baseline becomes the parent and all delta synchronization are written to replica 1. When a second snapshot (replica 2) is created, replica 2 becomes the child and replica 1 becomes the parent, and all delta synchronizations are written to replica 2.
Best Practices for HCX Protection Planning
Storage and bandwidth planning for replication at the target site depends on several factors:
- Data set size
Consider the data set for replication and the capacity of the virtual disks (VMDK files) that make up the target site virtual machine. Consider whether the target site virtual disks are thick- or thin-provisioned. For example, a 100 GB virtual disk that is thick-provisioned always consumes 100 GB. A 100 GB disk that is thin provisioned will consume only the actual amount of data stored on the disk up to 100 GB. While a thin provisioned disk may initially use only a fraction of the provisioned storage, it can grow to the fill the total storage space.
- Data change rate
Consider the amount of data replicated to the target location based on the rate of change in source virtual machine data. For example, a source virtual machine disk with 50 GB of data has an estimated daily change rate of 5 percent, meaning 2.5 GB of data is replicated each day.
Also, consider the maximum amount of data transferred for any one replication instance. Network bandwidth must be capable of meeting the RPO interval for the amount of data transferred.
- Recovery Point Objective interval
Assuming consistent rate of change on the source virtual machine, a lower RPO generally means smaller delta synchronizations but higher bandwidth consumption to meet the lower RPO. Setting the RPO interval to the largest interval that your organization can tolerate can help to reduce network issues.
- Network bandwidth
The replication network bandwidth must be sufficient to meet the RPO interval for the amount of data transferred. For example, if the RPO interval is 15 minutes, and the rate of change during that period is 1 GB, the network must capable of transferring that amount of data during the 15 minute interval. Set the number of recovery points as low as possible while still meeting business requirements.
- Retention policy
Having multiple recover points means having a copy of the point in time changes for each snapshot, which increases storage requirements by the amount of changes over the RPO interval times the amount of snapshots configured.
- Protection concurrency with migration operations
Ongoing HCX migrations use the HCX Interconnect (HCX-IX) appliance for virtual machine disk replications. Resources used during a Bulk or RAV transfer affect the total resources available for HCX Disaster Recovery (and vice versa) when the same service mesh appliances are used for both services
- Recovery and recovery testing
During recovery operations, or when testing a recovery plan with HCX Disaster Recovery, additional space is consumed by each recovered virtual machine. Normally, redo logs are consolidated into the replica base disk or into other redo logs if multiple recover points is activated. During a test recovery, some or all of the redo logs may be in use until the test recovery is cleaned up (completed). If redo logs are in use, HCX cannot consolidate the redo logs. Replication continues during a test recovery, which generates additional redo logs. The actual amount of storage capacity consumed depends on factors such as data change rates, replication frequencies, and how long the test recovery lasts.