In this scenario, primary site 1 is down either due to a scheduled maintenance or an unplanned power failure. All workloads are running on the secondary site 2 (promoted primary site), and traffic is being routed through the UDLR and the NSX Edges at site 2. Now, the original primary site 1 is up again and the NSX administrator wants to recover NSX components and restore all the workloads at the original primary site 1.
The
NSX administrator wants to meet the following key objectives:
- Achieve a full failback of all workloads from site 2 to original primary site 1 with minimal downtime.
- Retain the application IP addresses after failback to site 1.
- Automatically recover all Edge interface settings and BGP protocol configuration settings at site 1.
Note:
- The administrator can do the failback tasks manually by using either the vSphere Web Client or by running the NSX REST APIs. In addition, the administrator can automate some failback tasks by running a script file that contains the APIs to run during the failback. This scenario explains manual failback steps using the vSphere Web Client. However, if any step requires the use of either the CLI or the NSX REST APIs, adequate instructions are provided.
- In this scenario, the disaster recovery workflow is specific to the topology explained earlier, which has a primary NSX Manager and a single secondary NSX Manager. The workflow with multiple secondary NSX Managers is not in the scope of this scenario.
Prerequisites
- NSX Data Center 6.4.5 or later is installed at both sites 1 and 2.
- vCenter Server at sites 1 and 2 are deployed with Enhanced Linked Mode.
- At site 1 and site 2, the following conditions are met:
- No application-specific security policies are configured on a non-NSX firewall, if any.
- No application-specific firewall rules are configured on a non-NSX firewall, if any.
- Firewall is disabled on both the ESGs because ECMP is enabled on the UDLRs and to ensure that all traffic is allowed.
- At site 2 (promoted primary), no changes are made in the universal logical components before initiating the failback process.
Procedure
Results
The manual failback of all NSX components and workloads from the secondary site (site 2) to the primary site (site 1) is complete.
What to do next
Verify whether the failback to primary site 1 is 100% complete by doing these steps on site 1:
- Check whether the NSX Manager has the primary role.
- Check whether the Control VM (Edge Appliance VM) is deployed on the UDLR.
- Check whether the status of all controller cluster nodes is Connected.
- Perform a Communication Health Check on each host cluster that is prepared for NSX.
- Navigate to .
- Select the NSX Manager at site 1.
- Select one cluster at a time, and check whether the Communication Channel Health status of the cluster is UP.
- For each host in the cluster, check whether the Communication Channel Health status of the host is UP.
- Check whether the host preparation status is Green.
- Log in to the CLI console of the UDLR Control VM (Edge Appliance VM), and do these steps:
- Check whether all BGP neighbors are established and the status is UP by running the show ip bgp neighbors command.
- Check whether all BGP routes are being learned from all BGP neighbors by running the show ip route bgp command.
After a complete failback to site 1, all workloads run on the primary site 1 and traffic is routed through the UDLR and the NSX Edges at site 1.