Scenario 2: Unscheduled Full Site Failure

In this scenario, a natural disaster strikes at the primary site 1 in Palo Alto, and site 1 goes down completely. The NSX administrator performs a manual failover to the secondary site 2 in Austin.

As the primary site has gone down due to unforeseen circumstances, the administrator cannot do any failover preparation before the actual failure occurs.

The NSX administrator wants to meet the following key objectives:

Achieve a full site failover at site 2 with minimal downtime.
Retain site 1 application IP addresses at site 2 after the failover.
Automatically recover all Edge interface settings and BGP protocol configuration settings at site 2.

Note:

The administrator can do the failover tasks manually by using either the vSphere Web Client or by running the NSX REST APIs. In addition, the administrator can automate some failover tasks by running a script file that contains the APIs to run during the failover. This scenario explains manual failover steps using the vSphere Web Client. However, if any step requires the use of either the CLI or the NSX REST APIs, adequate instructions are provided.
In this scenario, the disaster recovery workflow is specific to the topology explained earlier, which has a primary NSX Manager and a single secondary NSX Manager. The workflow with multiple secondary NSX Managers is not in the scope of this scenario.

Important: If the primary site 1 powers on while the failover to the secondary site 2 is in progress, first ensure that the failover process is completed by using the procedure in this scenario. Only after a clean failover is done to the secondary site 2, restore or failback all the workloads to the original primary site 1. For detailed instructions about the failback process, see Scenario 3: Full Failback to Primary Site.

Prerequisites

NSX Data Center 6.4.5 or later is installed at both sites 1 and 2.
vCenter Server at sites 1 and 2 are deployed with Enhanced Linked Mode.
At site 1 and site 2, the following conditions are met:
- No application-specific security policies are configured on a non-NSX firewall, if any.
- No application-specific firewall rules are configured on a non-NSX firewall, if any.
- Firewall is disabled on both the ESGs because ECMP is enabled on the UDLRs and to ensure that all traffic is allowed.
At site 2, the following conditions are met before the failover:
- Similar downlink interfaces are configured manually on the ESGs as configured at site 1.
- Similar BGP configuration is done manually on the ESGs as configured at site 1.
- ESGs are in powered down state when the primary site 1 is active or running.

Procedure

Verify that the primary NSX Manager at site 1 is down.
1. On the Installation and Upgrade page, navigate to Management > NSX Managers.
  - If you refresh the NSX Managers page in the current browser session, the role of the primary NSX Manager changes to Unknown.
  - If you log out from the vSphere Web Client and log in again or start a new vSphere Web Client browser session, the primary NSX Manager is no longer displayed on the NSX Managers page.
2. Navigate to Networking & Security > Dashboard > Overview.
  - If you refresh the Dashboard page in the current browser session, the following error message is displayed: Could not establish connection with NSX Manager. Please contact administrator.. This error means that the primary NSX Manager is no longer reachable.
  - If you log out from the vSphere Web Client and log in again or start a new vSphere Web Client browser session, the primary NSX Manager is no longer available in the NSX Manager drop-down menu.
Promote the secondary NSX Manager to a primary role.
1. On the Installation and Upgrade page, navigate to Management > NSX Managers.
2. Select the secondary NSX Manager.
3. Click Actions > Disconnect from Primary NSX Manager. When prompted to continue with the disconnect operation, click Yes.
  The secondary NSX Manager is disconnected from the primary NSX Manager, and enters into a Transit role.
4. Click Actions > Assign Primary Role.
  The secondary NSX Manager at site 2 is promoted to a primary role.
Caution: As local egress is disabled on the UDLR, the UDLR Control VM (Edge Appliance VM) is deployed only at the original primary site (site 1). Before site 1 fails, the UDLR Control VM is not available at the secondary site (site 2), which is now promoted to primary. Therefore, redeploy the UDLR Control VM at the promoted primary site (site 2) before redeploying the NSX Controller Cluster.
If the controller nodes are deployed before deploying the UDLR Control VM, the forwarding tables on the UDLR are flushed out. This results in a downtime immediately after the first controller node is deployed at site 2. This situation might result in communication outages. To avoid this situation, deploy the UDLR Control VM before deploying the NSX Controller nodes.
Power on the NSX Edges that are in the powered down state, and deploy the UDLR Control VM (Edge Appliance VM) at the secondary site 2 (promoted primary).
For instructions about deploying the UDLR Control VM, see the NSX Cross-vCenter Installation Guide.
While deploying the UDLR Control VM, configure the following resource settings:
- Select the data center as site 2.
- Select the cluster/resource pool.
- Select the datastore.
Note: After deploying the UDLR Control VM, the following configuration settings are automatically recovered at site 2:
- BGP protocol routing configuration
- BGP password configuration
- Uplink and internal interface settings
Deploy the three NSX Controller Cluster nodes at site 2 (promoted primary).
For detailed instructions about deploying NSX Controllers, see the NSX Cross-vCenter Installation Guide.
Update the NSX Controller Cluster state.
1. On the Installation and Upgrade page, click NSX Managers.
2. Select the promoted primary NSX Manager.
3. Click Actions > Update Controller State.
Force sync routing service on each cluster at site 2.
1. On the Installation and Upgrade page, click Host Preparation.
2. Select the promoted primary NSX Manager.
3. Select one cluster at a time, and then click Actions > Force Sync Services.
4. Select Routing, and click OK.
Migrate the workload VMs from site 1 to site 2.

Note: The workload VMs continue to exist at site 1. Therefore, you must manually migrate the workload VMs to site 2.

Results

The manual recovery of NSX components and the failover from the primary site (site 1) to the secondary site (site 2) is complete.

What to do next

Verify whether the failover to site 2 is 100% complete by doing these steps on site 2 (promoted primary site):

Check whether the NSX Manager has the primary role.
Check whether the Control VM (Edge Appliance VM) is deployed on the UDLR.
Check whether the status of all controller cluster nodes is Connected.
Check whether the host preparation status is Green.
Log in to the CLI console of the UDLR Control VM (Edge Appliance VM), and do these steps:
1. Check whether all BGP neighbors are established and the status is UP by running the show ip bgp neighbors command.
2. Check whether all BGP routes are being learned from all BGP neighbors by running the show ip route bgp command.

After a complete failover to site 2, all workloads run on the secondary site (promoted primary) and traffic is routed through the UDLR and the NSX Edges at site 2.