High Availability Design for the NSX Edge Nodes for the Management Domain

The NSX Edge cluster runs on the default management vSphere cluster of VMware Cloud Foundation. vSphere HA and vSphere DRS protect the NSX Edge appliances. In an environment with multiple availability zones, to configure the first availability zone as the main location for the NSX Edge nodes, you use vSphere DRS.

NSX Edge Cluster Design

The NSX Edge cluster is a logical grouping of NSX Edge transport nodes. These NSX Edge appliances run on a vSphere cluster, and provide north-south routing and network services for the management workloads. You can dedicate this vSphere cluster only to edge appliance or can share it with the other management appliances.

Default management vSphere cluster: The default management vSphere cluster contains all components for managing VMware Cloud Foundation. See the vSphere Cluster Design for the Management Domain.
Dedicated edge vSphere cluster: A dedicated edge vSphere cluster contains only NSX Edge appliances for the management domain.

Table 1. Design Decisions on the NSX Edge Cluster Configuration
Decision ID	Design Decision	Design Justification	Design Implications
VCF-MGMT-NSX-EDGE-CFG-002	Deploy the NSX Edge virtual appliances in the default management vSphere cluster, sharing the cluster between the management workloads and the edge appliances.	Because of the prescriptive nature of the management domain, resource contention from unknown workloads is minimized Simplifies the configuration and minimizes the number of ESXi hosts required for initial deployment. Keeps the management components located in the same domain and cluster, isolated from tenant workloads.	None.
VCF-MGMT-NSX-EDGE-CFG-003	Deploy two NSX Edge appliances in an edge cluster in the default vSphere cluster in the management domain.	Creates the NSX Edge cluster for satisfying the requirements for availability and scale.	None.
VCF-MGMT-NSX-EDGE-CFG-004	Apply VM-VM anti-affinity rules for vSphere DRS to the virtual machines of the NSX Edge cluster.	Keeps the NSX Edge nodes running on different ESXi hosts for high availability.	None.
VCF-MGMT-NSX-EDGE-CFG-005	In vSphere HA, set the restart priority policy for each NSX Edge appliance to high.	The NSX Edge nodes are part of the north-south data path for overlay segments for management components. vSphere HA restarts the NSX Edge appliances first so that other virtual machines that are being powered on or migrated by using vSphere vMotion while the edge nodes are offline lose connectivity only for a short time. Setting the restart priority to high reserves highest for future needs.	If the restart priority for another management appliance is set to highest, the connectivity delays for management appliances will be longer .
VCF-MGMT-NSX-EDGE-CFG-006	Configure all edge nodes as transport nodes.	Enables the participation of edge nodes in the overlay network for delivery of services to the SDDC management components such as routing and load balancing.	None.
VCF-MGMT-NSX-EDGE-CFG-007	Create an NSX Edge cluster with the default Bidirectional Forwarding Detection (BFD) configuration between the NSX Edge nodes in the cluster.	Satisfies the availability requirements by default. Edge nodes must remain available to create services such as NAT, routing to physical networks, and load balancing.	None.

High Availability for a Single VMware Cloud Foundation Instance with Multiple Availability Zones

NSX Edge nodes connect to top of rack switches in each data center to support northbound uplinks and route peering for SDN network advertisement. This connection is specific to the top of rack switch that you are connected to.

If an outage of an availability zone occurs, vSphere HA fails over the edge appliances to the other availability zone. The second availability zone must provide an analog of the network infrastructure which the edge node is connected to in the first availability zone.

To support failover of the NSX Edge appliances, the following networks are stretched across the first and second availability zones. For information about all networks in a management domain with multiple availability zones, see Physical Network Infrastructure Design for NSX-T Data Center for the Management Domain.

Table 2. Networks That Are Stretched Across Availability Zones
Function	HA Layer 3 Gateway - Across Availability Zones
Management for the first availability zone	✓
Uplink01	x
Uplink02	x
Edge overlay	✓

Note:

The VLAN ID and Layer 3 network must be the same across both the availability zones. Additionally, the Layer 3 gateway at the first hop must be highly available such that it tolerates the failure of an entire availability zone.

Table 3. Design Decisions on High Availability of the NSX Edge Nodes for a Single VMware Cloud Foundation Instance with Multiple Availability Zones
Decision ID	Design Decision	Design Justification	Design Implication
VCF-MGMT-NSX-EDGE-CFG-008	Add the NSX Edge appliances to the virtual machine group for the first availability zone.	Ensures that, by default, the NSX Edge appliances are powered on upon a host in the primary availability zone.	None.

High Availability for Multiple VMware Cloud Foundation Instances

In an environment with multiple VMware Cloud Foundation instances, each instance has its own NSX Edge cluster for the management domain. In each instance, the edge nodes and cluster are deployed with the same design but with instance-specific settings such as IP addressing, VLAN IDs, and names. Each edge cluster is managed by the NSX Local Manager instance for the management domain.

Workload traffic between VMware Cloud Foundation instances traverses the inter-instance overlay tunnel which terminates on the RTEPs on the NSX Edge nodes. This tunnel is the data plane for inter-instance traffic.

Take into account the following considerations:

The RTEP network segment has VLAN ID and Layer 3 range that are specific to the individual data center fault domain.
If a VMware Cloud Foundation instance is deployed with multiple availability zones, the RTEP network segment must be stretched between the zones with the same VLAN ID and IP range. Additionally, the Layer 3 gateway at the first hop must be highly available such that it tolerates the failure of an entire availability zone.
In case of multiple VMware Cloud Foundation instances, each instance requires an Edge RTEP VLAN configured with a VLAN ID and IP range that are appropriate.

Table 4. Edge RTEP VLAN Configuration for the Management Domains for Multiple VMware Cloud Foundation Instances
Function	First Availability Zone	Second Availability Zone	High Availability Layer 3 Gateway
Edge RTEP in the first VMware Cloud Foundation instance	✓	✓	✓
Edge RTEP in the second VMware Cloud Foundation instance	✓	✓	✓