In this scenario, a natural disaster strikes at the primary site 1 in Palo Alto, and site 1 goes down completely. The NSX administrator performs a manual failover to the secondary site 2 in Austin.
As the primary site has gone down due to unforeseen circumstances, the administrator cannot do any failover preparation before the actual failure occurs.
The
NSX administrator wants to meet the following key objectives:
- Achieve a full site failover at site 2 with minimal downtime.
- Retain site 1 application IP addresses at site 2 after the failover.
- Automatically recover all Edge interface settings and BGP protocol configuration settings at site 2.
Note:
- The administrator can do the failover tasks manually by using either the vSphere Web Client or by running the NSX REST APIs. In addition, the administrator can automate some failover tasks by running a script file that contains the APIs to run during the failover. This scenario explains manual failover steps using the vSphere Web Client. However, if any step requires the use of either the CLI or the NSX REST APIs, adequate instructions are provided.
- In this scenario, the disaster recovery workflow is specific to the topology explained earlier, which has a primary NSX Manager and a single secondary NSX Manager. The workflow with multiple secondary NSX Managers is not in the scope of this scenario.
Important: If the primary site 1 powers on while the failover to the secondary site 2 is in progress, first ensure that the failover process is completed by using the procedure in this scenario. Only after a clean failover is done to the secondary site 2, restore or failback all the workloads to the original primary site 1. For detailed instructions about the failback process, see
Scenario 3: Full Failback to Primary Site.
Prerequisites
- NSX Data Center 6.4.5 or later is installed at both sites 1 and 2.
- vCenter Server at sites 1 and 2 are deployed with Enhanced Linked Mode.
- At site 1 and site 2, the following conditions are met:
- No application-specific security policies are configured on a non-NSX firewall, if any.
- No application-specific firewall rules are configured on a non-NSX firewall, if any.
- Firewall is disabled on both the ESGs because ECMP is enabled on the UDLRs and to ensure that all traffic is allowed.
- At site 2, the following conditions are met before the failover:
- Similar downlink interfaces are configured manually on the ESGs as configured at site 1.
- Similar BGP configuration is done manually on the ESGs as configured at site 1.
- ESGs are in powered down state when the primary site 1 is active or running.
Procedure
Results
The manual recovery of NSX components and the failover from the primary site (site 1) to the secondary site (site 2) is complete.
What to do next
Verify whether the failover to site 2 is 100% complete by doing these steps on site 2 (promoted primary site):
- Check whether the NSX Manager has the primary role.
- Check whether the Control VM (Edge Appliance VM) is deployed on the UDLR.
- Check whether the status of all controller cluster nodes is Connected.
- Check whether the host preparation status is Green.
- Log in to the CLI console of the UDLR Control VM (Edge Appliance VM), and do these steps:
- Check whether all BGP neighbors are established and the status is UP by running the show ip bgp neighbors command.
- Check whether all BGP routes are being learned from all BGP neighbors by running the show ip route bgp command.
After a complete failover to site 2, all workloads run on the secondary site (promoted primary) and traffic is routed through the UDLR and the NSX Edges at site 2.