Business Continuity and Disaster Recovery

Business continuity and disaster recovery solutions are an integral part of the vCloud NFV OpenStack Edition platform. To achieve a robust solution, the platform uses three components.

Table 1. NFVI Business Continuity and Disaster Recovery Components
Component Name	Description
VMware vSphere Replication	Hypervisor-based asynchronous replication solution that provides granular replication and recovery of management components.
VMware Site Recovery Manager	Disaster recovery management and orchestration engine for providing predictable failover of management components
VMware vSphere Data Protection	Data protection solution that performs backup and recovery of management components.

The methods for using these three business continuity and disaster recovery tools to ensure healthy operations in the NFV environment are described in the following sections of the document. While a multi-site design will be the subject of a future document, this reference architecture provides an overview of the business continuity capabilities built into vCloud NFV OpenStack Edition.

VMware vSphere Replication

vSphere Replication is the technology used to replicate virtual machine data between data center objects within a single site or across sites. It fully supports vSAN. vSphere Replication is deployed as an appliance within the management cluster to provide a Recovery Point Objective (RPO) of five minutes to 24 hours.

The two most important aspects to be considered when designing or executing a disaster recovery plan are RPO and Recovery Time Objective (RTO) . RPO is the duration of acceptable data loss. It is fulfilled by the replication technology. RTO is a target duration with an attached service-level agreement, during which the business process must be restored. It includes the time for the recovery and service readiness, in a state for business to operate as usual.

vSphere Replication provides the ability to set the RPO, however RTO is application dependent.

VMware Site Recovery Manager

Site Recovery Manager provides a solution for automating the recovery and execution of a disaster recovery plan, in the event of a disaster in a data center. When a catastrophe occurs, components in the Management pod must be available to recover and continue the healthy operations of the NFV-based services.

To ensure robust business continuity and disaster recovery, network connectivity between the protected and recovery sites is required, with enough bandwidth capacity to replicate the management components using vSphere Replication. Each site must have an instance of vCenter Server that governs the Management pod and its ESXi hosts, and a Site Recovery Manager server and vSphere Replication appliance to orchestrate the disaster recovery workflows and replicate content across the sites. The protected site provides business critical services, while the recovery site is an alternative infrastructure on which services are recovered in the event of a disaster.

Networking Considerations

Moving a service from one site to another represents a networking challenge in terms of maintaining IP addressing, security policies, and bandwidth ensuring ample network capacity. Some of these challenges, such as IP addressing, are managed by using NSX for vSphere.

Distributed Resource Scheduler Considerations

Some management components for the vCloud NFV OpenStack Edition platform such as NSX for vSphere, Edge Services Gateway, PSCs, vCloud Director cells, vRealize Operations Manager, and vRealize Log Insight have specific affinity or anti-affinity rules configured for availability. When protected management components are recovered at a recovery site, DRS rules, reservations, and limits are not carried over as part of the recovery plan. However, it is possible to manually configure rules, reservations, and limits on placeholder virtual machines at the recovery site, during the platform build.

Inventory Mappings

Elements in the vCenter Server inventory list can be mapped from the protected site to their vCenter Server inventory counterparts on the recovery site. Such elements include virtual machine folders, clusters or resource pools, and networks. All items within a single data center on the protected site must map to a single data center on the recovery site.

These inventory mapping details are used across both the protected and recovery sites:

Resource mapping maps cluster objects on the protected site to cluster objects on the recovery site
Folder mapping maps the folder structures like data centers or virtual machine folders on the protected site to folder structures on the recovery site
Network mapping maps the management networks on the protected site to management networks on the recovery site

VNF Recovery Considerations

Every vendor must provide a specific strategy for disaster recovery for any VNF managed directly by the VNF Managers.

Protection Groups

A protection group is a group of management components at the protected site that can fail over together to the recovery site during testing and recovery. All protected management components are placed within a single protection group.

Recovery Plans

Recovery plans are the run books associated with a disaster recovery scenario. A recovery plan determines which management components are started, what needs to be powered down, which scripts to run, the startup order, and the overall automated execution of the failover.

A complete site failure is the only scenario that invokes a disaster recovery. There is no requirement for recovery plans to handle planned migrations or to move a single failed application within the management cluster. A single recovery plan is created for the automated failover of the primary site, and the placement of management components into priority groups ensures the correct startup order

The recovery of the resource cluster, edge cluster, vCenter Server, and NSX Manager are required to maintain management capabilities where additional physical data centers are managed within the site.