NSX-T Data Center supports multisite deployments where you can manage all the sites from one NSX Manager cluster.

Two types of multisite deployments are supported:
  • Disaster recovery
  • Active-active

The following diagram illustrates a disaster recovery deployment.


multisite disaster recovery deployment

In an active-active deployment, all the sites are active and layer 2 traffic crosses the site boundaries. In a disaster recovery deployment, NSX-T Data Center at the primary site handles networking for the enterprise. The secondary site is standing by to take over if a catastrophic failure occurs at the primary site.

The following diagram illustrates an active-active deployment.


multisite active-active deployment

You can deploy two sites for automatic or manual/scripted recovery of the management plane and the data plane.

Automatic Recovery of the Management Plane

Requirements:
  • A stretched vCenter cluster with HA across sites configured.
  • A stretched management VLAN.

The NSX Manager cluster is deployed on the management VLAN and is physically in the primary site. If there is a primary site failure, vSphere HA will restart the NSX Managers in the secondary site. All the transport nodes will reconnect to the restarted NSX Managers automatically. This process takes about 10 minutes. During this time, the management plane is not available but the data plane is not impacted.

The following diagram illustrates automatic recovery of the management plane.

Automatic Recovery of the Data Plane

Requirements:
  • The maximum latency between Edge nodes is 10 ms.
  • The HA mode for the tier-0 gateway must be active-standby, and the failover mode must be preemptive.

Note: The failover mode of the tier-1 gateway can be preemptive or non-preemptive.

Configuration steps:
  • Using the API, create failure domains for the two sites, for example, FD1A-Preferred_Site1 and FD2A-Preferred_Site1. Set the parameter preferred_active_edge_services to true for the primary site and set it to false for the secondary site.
    POST /api/v1/failure-domains
    {
    "display_name": "FD1A-Preferred_Site1",
    "preferred_active_edge_services": "true"
    }
    
    POST /api/v1/failure-domains
    {
    "display_name": "FD2A-Preferred_Site1",
    "preferred_active_edge_services": "false"
    }
  • Using the API, configure an Edge cluster that is stretched across the two sites. For example, the cluster has Edge nodes EdgeNode1A and EdgeNode1B in the primary site, and Edge nodes EdgeNode2A and EdgeNode2B in the secondary site. The active tier-0 and tier-1 gateways will run on EdgeNode1A and EdgeNode1B. The standby tier-0 and tier-1 gateways will run on EdgeNode2A and EdgeNode2B.
  • Using the API, associate each Edge node with the failure domain for the site. First call the GET /api/v1/transport-nodes/<transport-node-id> API to get the data about the Edge node. Use the result of the GET API as the input for the PUT /api/v1/transport-nodes/<transport-node-id> API, with the additional property, failure_domain_id, set appropriately. For example,
    GET /api/v1/transport-nodes/<transport-node-id>
    Response:
    {
        "resource_type": "TransportNode",
        "description": "Updated NSX configured Test Transport Node",
        "id": "77816de2-39c3-436c-b891-54d31f580961",
        ...
    }
    
    PUT /api/v1/transport-nodes/<transport-node-id>
    {
        "resource_type": "TransportNode",
        "description": "Updated NSX configured Test Transport Node",
        "id": "77816de2-39c3-436c-b891-54d31f580961",
        ...
        "failure_domain_id": "<UUID>",
    }
  • Using the API, configure the Edge cluster to allocate nodes based on failure domain. First call the GET /api/v1/edge-clusters/<edge-cluster-id> API to get the data about the Edge cluster. Use the result of the GET API as the input for the PUT /api/v1/edge-clusters/<edge-cluster-id> API, with the additional property, allocation_rules, set appropriately. For example,
    GET /api/v1/edge-clusters/<edge-cluster-id>
    Response:
    {
        "_revision": 0,
        "id": "bf8d4daf-93f6-4c23-af38-63f6d372e14e",
        "resource_type": "EdgeCluster",
        ...
    }
    
    PUT /api/v1/edge-clusters/<edge-cluster-id>
    {
        "_revision": 0,
        "id": "bf8d4daf-93f6-4c23-af38-63f6d372e14e",
        "resource_type": "EdgeCluster",
        ...
        "allocation_rules": [
            {
                "action": {
                          "enabled": true,
                          "action_type": "AllocationBasedOnFailureDomain"
                          }
            }
        ],
    }
  • Create tier-0 and tier-1 gateways using the API or NSX Manager UI.

When an Edge node in the primary site fails, the tier-0 and tier-1 gateways hosted on that node will be migrated to an Edge node in the secondary site.

The following diagram illustrates automatic recovery of the data plane.

Manual/Scripted Recovery of the Management Plane

Requirements:
  • DNS for NSX Managers with a short TTL (for example, 5 minutes).
  • Continuous backup.

Neither vSphere HA, nor a stretched management VLAN, is required. NSX-T Managers must be associated with a DNS name with a short TTL. All transport nodes (Edge nodes and hypervisors) must connect to the NSX Manager using their DNS name. To save time, you can optionally pre-install an NSX Manager cluster in the secondary site.

The recovery steps are:
  1. Change the DNS record so that the NSX Manager cluster has different IP addresses.
  2. Restore the NSX Manager cluster from a backup.
  3. Connect the transport nodes to the new NSX Manager cluster.

The following diagram illustrates manual/scripted recovery of the management plane.

Manual/Scripted Recovery of the Data Plane

Requirement:
  • The maximum latency between Edge nodes is 150 ms.

The Edge nodes can be VMs or bare metal. The tier-0 gateway can be active-standby or active-active. Edge node VMs can be installed in different vCenter Servers. No vSphere HA is required.

The recovery steps are:
  1. Create a standby tier-0 gateway on an existing Edge cluster in the disaster recovery (DR) site.
  2. Using the API, move the tier-1 gateways that are connected to a tier-0 gateway to the tier-0 gateway in the DR site.
  3. Using the API, move the standalone tier-1 gateways to the DR site.
  4. Using the API, move the layer-2 bridges to the DR site.

The following diagram illustrates manual/scripted recovery of the data plane.

Requirements for Multisite Deployments

Inter-site Communication
  • The bandwidth must be at least 1 Gbps and the latency (RTT) must be less than 150 ms.
  • MTU must be at least 1600. 9000 is recommended.
NSX Manager Configuration
  • Automatic backup when NSX-T Data Center configuration changes must be enabled.
  • NSX Manager must be set up to use FQDN.
Data Plane Recovery
  • The same internet provider must be used if public IP addresses are exposed through services such as NAT or load balancer.
  • The HA mode for the tier-0 gateway must be active-standby, and the failover mode must be preemptive.
Cloud Management System
  • The cloud management system (CMS) must support an NSX-T Data Center plug-in. In this release, VMware Integrated OpenStack (VIO) and vRealize Automation (vRA) satisfy this requirement.

Limitations

  • No local-egress capabilities. All north-south traffic must occur within one site.
  • The compute disaster recovery software must support NSX-T Data Center, for example, VMware SRM 8.1.2 or later.