When you run a recovery plan for failover, the Orchestrator accurately moves your VMs to a failover site as defined in your plan and to optimize RTO.

Temporary Snapshots and Failover Operations

VMware Live Cyber Recovery creates temporary snapshots as part of the failover workflow to reduce the risk of data loss.

During failover, VMware Live Cyber Recovery uses these temporary snapshots to accumulate deltas, and if there are any disruptive operations, another snapshot is taken to make sure VMs can be restored completely.

These temporary snapshots are deleted once failback completes, when no VMs are running on cloud file system, and when all plans are committed.

No Overwriting of Existing VMs

To avoid undesirable side effects of executing potentially erroneous plans, a running plan never overwrites VMs that exist on the destination datastore. If another VM is already present on the destination datastore at the exact datastore path matching the path of the recovering VM, VMware Live Cyber Recovery does not attempt to recover that VM.

Such VM recoveries are flagged as ‘failed’ during failover or test failover and the existing VMs are preserved. To make automatic recovery of these types of VMs possible, the conflicting VMs must be explicitly deleted from the destination datastore before running a failover operation.

Batching

VMware Live Cyber Recovery recovers VMs in fixed size batches, also called substeps. VM batching is done to:

  • Recover VMs concurrently to improve RTO
  • Fine-grained retry on encountering errors
  • Control the load on external components

All VMs in a batch are recovered concurrently, improving overall RTO. Recovering individual VMs can involve many different stages, such as selecting snapshots, customizing IP addresses, reconfiguring VM to reflect failover mappings, powering-on VMs, and other configurations. Parallelizing stages of a failover plan across a set of all VMs in a batch improves the overall throughput and reduces RTO.

Improving error handling is another reason for failover batching. VMware Live Cyber Recovery supports retry of VM recovery on transient errors or following error remediation. When the “stop on all errors” setting is configured in a plan, the running plan will stop following a failed batch running with some VMs encountering errors.

Batch running is atomic in that the execution stops after all VMs in a batch have reached a terminal state, either successful recovery, or an error. Upon addressing the error condition, the failed batch can be retried. Similarly, VMware Live Cyber Recovery can automatically retry the operation of the last batch upon observing transient errors (for example, transient network connectivity problems).

As part of retry, VMware Live Cyber Recovery rolls back the batch operation and then restarts it. Batching reduces the throwaway work for large plans by limiting the rollback to the failing batch only.

Batching limits concurrency imposed on other external components involved in recovery. For example, batching limits the number of concurrent requests issued to the vCenter Server on the recovery site. When failover involves fetching remote snapshots or performing Storage vMotion, batching will naturally limit concurrency for these operations resulting in better overall system throughput.

Skip VMs Not Registered with vCenter

If a VM is not registered with vCenter on the protected site, it will not be automatically recovered and registered with vCenter on the recovery SDDC.

VM Tags and Tag Categories

The failover process associates vSphere tags with recovered VMs that were associated with the VM on the original protected site. However, the tags and their associated categories must be pre-configured on the Recovery SDDC for successful failover and failback.

When you fail over VMs with tags, be aware of two possible environmental situations:

  • A) Tags are present on both the protected site vCenter configuration and on the recovery SDDC.
  • B) Tags are present on the protected site vCenter configuration, but the tags do not exist on the Recovery SDDC.

During a recovery plan compliance check, the system scans every VM in the protection group to make sure all tags associated with all VMs in the PG are available on the recovery SDDC.

If the category and the tags present on a VM do not exist on the recovery SDDC, then they will be flagged by the compliance checks as errors.

When you perform a failover in these two scenarios, you might have to perform extra steps before you can commit or failback the recovery plan.

In scenario (A), all categories and tags have been created on the recovery SDDC. After failover, each VM is started on the recovery SDDC and tagged with the same tags it had on the source site. In this situation, no extra action is required before committing or failing back the recovery plan.

In Scenario (B), where some tags are missing (and compliance check was failing, failover will proceed, but it will complete with errors.

Hence, before you try to committing the recovery plan, or failing back, after the failover completes with errors you manually have to create the missing tags on the recovery SDDC from the VMware Cloud on AWS Console. Then, you can proceed to commit the plan or run a failback operation with the plan.

Migration Limits During Failover/Failback

During failover or failback, VMs in the recovery plan are migrated to the 'WorkloadDatastore' in vSphere. When vSphere migration limits are reached, the failover or failback tasks might report ‘Resources currently in use by other operations. Waiting’.

For information, see Limits on Simultaneous Migrations.

You can choose to bypass Storage vMotion migration to the recovery SDDC and run failed over VMs live on the cloud file system. This failover runtime setting uses cloud backup as highly available (HA) storage and runs recovered VMs directly from the cloud file system. With this option, failover is faster and there is no dependency on SDDC hosts for storage capacity. For more information, see Configure VM Storage for Failover.