High Availability (HA) ensures that the services provided by NSX Edge appliances are available even when a hardware or software failure renders a single appliance unavailable. NSX Edge HA minimizes failover downtime instead of delivering zero downtime, as the failover between appliances might require some services to be restarted.

For example, NSX Edge HA synchronizes the connection tracker of the stateful firewall, or the stateful information held by the load balancer. The time required to bring all services backup is not null. Examples of known service restart impacts include a non-zero downtime with dynamic routing when an NSX Edge is operating as a router.

Sometimes, the two NSX Edge HA appliances are unable to communicate and unilaterally decide to become active. This behavior is expected to maintain availability of the active NSX Edge services if the standby NSX Edge is unavailable. If the other appliance still exists, when the communication is re-established, the two NSX Edge HA appliances renegotiate active and standby status. If this negotiation does not finish and if both appliances declare they are active when the connectivity is re-established, an unexpected behavior is observed. This condition, known as split brain, is observed due to the following environmental conditions:

  • Physical network connectivity issues, including a network partition.
  • CPU or memory contention on the NSX Edge.
  • Transient storage problems that might cause at least one NSX Edge HA VM to become unavailable.

    For example, an improvement in NSX Edge HA stability and performance is observed when the VMs are moved off overprovisioned storage. In particular, during large overnight backups, large spikes in storage latency can impact NSX Edge HA stability.

  • Congestion on the physical or virtual network adapter involved with the exchange of packets.

In addition to environmental issues, a split-brain condition is observed when the HA configuration engine falls into a bad state or when the HA daemon fails.

Stateful High Availability

The primary NSX Edge appliance is in the active state and the secondary appliance is in the standby state. NSX Manager replicates the configuration of the primary appliance for the standby appliance or you can manually add two appliances. Create the primary and secondary appliances on separate resource pools and datastores. If you create the primary and secondary appliances on the same datastore, the datastore must be shared across all hosts in the cluster for the HA appliance pair to be deployed on different ESXi hosts. If the datastore is local storage, both virtual machines are deployed on the same host.

All NSX Edge services run on the active appliance. The primary appliance maintains a heartbeat with the standby appliance and sends service updates through an internal interface.

If a heartbeat is not received from the primary appliance within the specified time (default value is 15 seconds), the primary appliance is declared dead. The standby appliance moves to the active state, takes over the interface configuration of the primary appliance, and starts the NSX Edge services that were running on the primary appliance. When the switch over takes place, a system event is displayed in the System Events tab of Settings & Reports. Load Balancer and VPN services need to re-establish TCP connection with NSX Edge, so service is disrupted for a short while. Logical switch connections and firewall sessions are synched between the primary and standby appliances however, service is disrupted during the switch over while waiting for the standby appliance to become active and take over.

If the NSX Edge appliance fails and a bad state is reported, HA force syncs the failed appliance to revive it. When revived, it takes on the configuration of the now-active appliance and stays in a standby state. If the NSX Edge appliance is dead, you must delete the appliance and add a new one.

NSX Edge ensures that the two HA NSX Edge virtual machines are not on the same ESXi host even after you use DRS and vMotion (unless you manually vMotion them to the same host). Two virtual machines are deployed on vCenter in the same resource pool and datastore as the appliance you configured. Local link IPs are assigned to HA virtual machines in the NSX Edge HA so that they can communicate. You can specify management IP addresses to override the local links.

If syslog servers are configured, logs in the active appliance are sent to the syslog servers.

High Availability in a Cross-vCenter NSX Environment

If you enable high availability on an NSX Edge in a cross-vCenter NSX environment, both the active and standby NSX Edge Appliances must reside in the same vCenter Server. If you migrate one of the appliances of an NSX Edge HA pair to a different vCenter Server, the two HA appliances no longer operate as an HA pair, and you might experience traffic disruption.

vSphere High Availability

NSX Edge HA is compatible with vSphere HA. If the host on which a NSX Edge instance is running dies, the NSX Edge is restarted on the standby host, ensuring the NSX Edge HA pair is still available to take another failover.

If vSphere HA is not enabled, the active-standby NSX Edge HA pair will survive one fail-over. However, if another fail-over happens before the second HA pair was restored, NSX Edge availability can be compromised.

For more information on vSphere HA, see vSphere Availability.