In the event of a failure, the site is recovered by executing an automated recovery plan. Data replication across protected and failover zones is necessary to recover the state of the site.

Multi-Site Operation and Disaster Recovery

In a hierarchical model, each leaf Edge site is treated as a self-contained entity with a limited number of hosts performing specific functions. When an Edge site fails due either to connectivity issues or a complete outage, such as fires, floods, and other disaster based shutdown, the next level in the hierarchy (central site) should take action. There are two options to accomplish this:

  • Restart the workloads that were on the failed Edge site in another Edge site assuming that the new Edge site has additional capacity.

  • Spin up these workloads on the central site itself.

    This option is similar to the DR to cloud use case and assumes that the cloud/central site has sufficient capacity to accommodate the workloads that must be recovered and restarted. It is possible to use VMware HCX as an engine for migration of workloads assuming that the destination (central site) has sufficient capacity to run these workloads.

While both these options are viable with respect to restarting workloads in a software defined infrastructure environment, there are some unsolved issues with respect to networking. For example, the user equipment must now establish connectivity to the new site where the workloads are being hosted, for example VNFs. In addition, the network bandwidth requirements for Internet breakout and the additional workload traffic (either from an Edge site or central site) are higher. A related issue is the IP address management for these restarted workloads on the new site.