Scenario 3: Full Failback to Primary Site

In this scenario, primary site 1 is down either due to a scheduled maintenance or an unplanned power failure. All workloads are running on the secondary site 2 (promoted primary site), and traffic is being routed through the UDLR and the NSX Edges at site 2. Now, the original primary site 1 is up again and the NSX administrator wants to recover NSX components and restore all the workloads at the original primary site 1.

The NSX administrator wants to meet the following key objectives:

Achieve a full failback of all workloads from site 2 to original primary site 1 with minimal downtime.
Retain the application IP addresses after failback to site 1.
Automatically recover all Edge interface settings and BGP protocol configuration settings at site 1.

Note:

The administrator can do the failback tasks manually by using either the vSphere Web Client or by running the NSX REST APIs. In addition, the administrator can automate some failback tasks by running a script file that contains the APIs to run during the failback. This scenario explains manual failback steps using the vSphere Web Client. However, if any step requires the use of either the CLI or the NSX REST APIs, adequate instructions are provided.
In this scenario, the disaster recovery workflow is specific to the topology explained earlier, which has a primary NSX Manager and a single secondary NSX Manager. The workflow with multiple secondary NSX Managers is not in the scope of this scenario.

Prerequisites

NSX Data Center 6.4.5 or later is installed at both sites 1 and 2.
vCenter Server at sites 1 and 2 are deployed with Enhanced Linked Mode.
At site 1 and site 2, the following conditions are met:
- No application-specific security policies are configured on a non-NSX firewall, if any.
- No application-specific firewall rules are configured on a non-NSX firewall, if any.
- Firewall is disabled on both the ESGs because ECMP is enabled on the UDLRs and to ensure that all traffic is allowed.

At site 2 (promoted primary), no changes are made in the universal logical components before initiating the failback process.

Procedure

When the primary site 1 is up again, make sure that the NSX Manager and the controller cluster nodes are powered on and running.
1. Navigate to Networking & Security > Dashboard > Overview.
2. Select the primary NSX Manager from the drop-down menu.
3. In the System Overview pane, check the status of the NSX Manager and the controller cluster nodes.
  A filled green dot next to NSX Manager and the controller nodes mean that both the NSX components are powered on and running.
Before initiating the failback process, verify the following:
1. On the Installation and Upgrade page, navigate to Management > NSX Managers, and observe that NSX Managers at both sites have a primary role.
2. On the NSX Controller Nodes page, ensure that the Universal Controller Cluster (UCC) nodes exist at both sites.
Shut down all the three UCC nodes that are associated with site 2 (promoted primary).
On the NSX Controller Nodes page, delete all the three UCC nodes that are associated with site 2 (promoted primary).

Tip: You can use the NSX REST APIs to remove one controller node at a time by running the following API call: https://NSX_Manager_IP/api/2.0/vdn/controller/{controllerID}. However, delete the last controller node forcefully by running the following API call: https://NSX_Manager_IP/api/2.0/vdn/controller/{controllerID}?forceRemoval=true.
Ensure that there are no changes in the universal components at site 2 (promoted primary) before proceeding to the next step.
Remove the primary role on the NSX Manager at site 2 (promoted primary).
1. On the Installation and Upgrade page, navigate to Management > NSX Managers.
2. Select the NSX Manager at site 2, and click Actions > Remove Primary Role.
  A message prompts you to ensure that the controllers owned by the NSX Manager at site 2 are deleted before removing the primary role.
3. Click Yes.
  The NSX Manager at site 2 enters into a Transit role.
On the primary NSX Manager at site 1, remove the associated secondary NSX Manager.
1. On the NSX Managers page, select the NSX Manager that is associated with site 1.
2. Click Actions > Remove Secondary Manager.
3. Select the Perform Operation even if NSX Manager is inaccessible check box.
4. Click Remove.
Register the NSX Manager at site 2, which is in Transit, as the secondary of primary NSX Manager at site 1.

Caution: Because local egress is disabled on the UDLR Control VM (Edge Appliance VM), the Control VM is automatically deleted. Therefore, before registering the NSX Manager at site 2 (currently in Transit role) with a secondary role, make sure that the controller cluster nodes at site 2 are deleted. If the controller cluster nodes are not deleted, network traffic disruption can occur.
1. On the Installation and Upgrade page, navigate to Management > NSX Managers.
2. Select the NSX Manager that is associated with site 1.
3. Click Actions > Add Secondary Manager.
4. Select the NSX Manager that is associated with site 2.
5. Enter the user name and password of the NSX Manager at site 2, and accept the security certificate.
6. Click Add.
After completing all these substeps, observe the following results:
- NSX Manager at site 1 has a primary role, and NSX Manager at site 2 has a secondary role.
- On the NSX Manager at site 2, three shadow controller nodes appear with status as Disconnected. The following message is displayed: Can read or update controller cluster properties only on Primary or Standalone Manager.
  This message means that the secondary NSX Manager at site 2 is unable to establish connectivity with the Universal Controller Cluster nodes on the primary NSX Manager at site 1. However, after a few seconds, the connection gets reestablished and the status changes to Connected.
Power on the Control VM (Edge Appliance VM) on the UDLR and the NSX Edges at site 1.
1. Navigate to Networking > VMs > Virtual Machines.
2. Right-click the VM Name (VM ID) of the UDLR Control VM and click Power on.
3. Repeat step (b) for the Edge VMs that you want to power on.
4. Wait until the UDLR Control VM and Edge VMs are up and running before proceeding to the next step.
Make sure that the UDLR Control VM (Edge Appliance VM) that is associated with the secondary NSX Manager at site 2 is automatically deleted.
1. Navigate to Networking & Security > NSX Edges.
2. Select the secondary NSX Manager, and then click a UDLR.
3. On the Status page, observe that no Edge Appliance VM is deployed on the UDLR.
Update the NSX Controller state on the primary site 1 so that the controller services are synced with the secondary site 2.
1. On the Installation and Upgrade page, click NSX Managers.
2. Select the primary NSX Manager at site 1.
3. Click Actions > Update Controller State.
Migrate the workload VMs from site 2 to site 1.

Note: The workload VMs continue to exist at site 2. Therefore, you must manually migrate the workload VMs to site 1.

Results

The manual failback of all NSX components and workloads from the secondary site (site 2) to the primary site (site 1) is complete.

What to do next

Verify whether the failback to primary site 1 is 100% complete by doing these steps on site 1:

Check whether the NSX Manager has the primary role.
Check whether the Control VM (Edge Appliance VM) is deployed on the UDLR.
Check whether the status of all controller cluster nodes is Connected.
Perform a Communication Health Check on each host cluster that is prepared for NSX.
1. Navigate to Installation and Upgrade > Host Preparation.
2. Select the NSX Manager at site 1.
3. Select one cluster at a time, and check whether the Communication Channel Health status of the cluster is UP.
4. For each host in the cluster, check whether the Communication Channel Health status of the host is UP.
5. Check whether the host preparation status is Green.
Log in to the CLI console of the UDLR Control VM (Edge Appliance VM), and do these steps:
1. Check whether all BGP neighbors are established and the status is UP by running the show ip bgp neighbors command.
2. Check whether all BGP routes are being learned from all BGP neighbors by running the show ip route bgp command.

After a complete failback to site 1, all workloads run on the primary site 1 and traffic is routed through the UDLR and the NSX Edges at site 1.