This chapter reviews the typical failure scenarios that might affect components of NSX routing subsystem and outlines the effects of these failures.

NSX Manager

Table 1. NSX Manager Faiure Modes and Effects

Failure Mode

Failure Effects

Loss of network connectivity to NSX Manager VM

  • Total outage of all NSX Manager functions, including CRUD for NSX routing/bridging

  • No configuration data loss

  • No data or control-plane outage

Loss of network connectivity between NSX Manager and ESXi hosts or RabbitMQ server failure

  • If DLR Control VM or ESG is running on affected hosts, CRUD operations on them fail

  • Creation and deletion of DLR instances on affected hosts fail

  • No configuration data loss

  • No data or control-plane outage

  • Any dynamic routing updates continue to work

Loss of network connectivity between NSX Manager and Controllers

  • Create, update, and delete operations for NSX distributed routing and bridging fail

  • No configuration data loss

  • No data or control-plane outage

NSX Manager VM is destroyed (datastore failure)

  • Total outage of all NSX Manager functions, including CRUD for NSX routing/bridging

  • Risk of subset of routing/bridging instances becoming orphaned if NSX Manager restored to an older configuration, requiring manual clean-up and reconciliation

  • No data or control-plane outage, unless reconciliation is required

Controller Cluster

Table 2. NSX Controller Faiure Modes and Effects

Failure Mode

Failure Effects

Controller cluster loses network connectivity with ESXi hosts

  • Total outage for DLR Control Plane functions (Create, update, and delete routes, including dynamic)

  • Outage for DLR Management Plane functions (Create, update, and delete LIFs on hosts)

  • VXLAN forwarding is affected, which may cause end to end (L2+L3) forwarding process to also fail

  • Data plane continues working based on the last-known state

One or two Controllers lose connectivity with ESXi hosts

  • If affected Controller can still reach other Controllers in the cluster, any DLR instances mastered by this Controller experience the same effects as described above. Other Controllers do not automatically take over

One Controller loses network connectivity with other Controllers (or completely)

  • Two remaining Controllers take over VXLANs and DLRs handled by the isolated Controller

  • Affected Controller goes into Read-Only mode, drop its sessions to hosts, and refuse new ones

Controllers lose connectivity with each other

  • All Controllers will go into Read-Only mode, close connections to hosts, and refuse new ones

  • Create, update, and delete operations for all DLRs’ LIFs and routes (including dynamic) fail

  • NSX routing configuration (LIFs) might get out of sync between the NSX Manager and Controller Cluster, requiring manual intervention to resync

  • Hosts will continue operating on last known control plane state

One Controller VM is lost

  • Controller Cluster loses redundancy

  • Management/Control plane continues to operate as normal

Two Controller VMs are lost

  • Remaining Controller will go into read-only mode; effect is the same as when Controllers lose connectivity with each other (above). Likely to require manual cluster recovery

Host Modules

netcpa relies on host SSL key and certificate, plus SSL thumbprints, to establish secure communications with the Controllers. These are obtained from NSX Manager via the message bus (provided by vsfwd).

If certificate exchange process fails, netcpa will not be able to successfully connect to Controllers.

Note: This section doesn’t cover failure of kernel modules, as the effect of this is severe (PSOD) and is a rare occurrence.

Table 3. Host Module Faiure Modes and Effects

Failure Mode

Failure Effects

vsfwd uses username/password authentication to access message bus server, which can expire

  • If a vsfwd on a freshly prepared ESXi host cannot reach NSX Manager within two hours, the temporary login/password supplied during installation expires, and message bus on this host becomes inoperable

Effects of failure of the Message Bus Client (vsfwd) depend on the timing.

If it fails before other parts of NSX control plane had a chance to reach steady running state

  • Distributed routing on the host stops functioning, because the host is not be able to talk to Controllers

  • Host do not learn DLR instances from NSX Manager

If it fails after host has reached steady state

  • ESGs and DLR Control VMs running on the host won’t be able to receive configuration updates

  • Host do not learn of new DLRs, and are not able to delete existing DLRs

  • Host datapath will continue operating based on the configuration host had at the time of failure

Table 4. netcpa Faiure Modes and Effects

Failure Mode

Failure Effects

Effects of failure of the Control Plane Agent (netcpa) depend on the timing

If it fails before NSX datapath kernel modules had a chance to reach steady running state

  • Distributed routing on the host stops functioning

If it fails after host has reached steady state

  • DLR Control VM(s) running on the host will not be able to send their forwarding table updates to Controller(s)

  • Distributed routing datapath will not receive any LIF or route updates from Controller(s), but will continue operating based on the state it had before the failure

DLR Control VM

Table 5. DLR Control VM Faiure Modes and Effects

Failure mode

Failure Effects

DLR Control VM is lost or powered off

  • Create, update, and delete operations for this DLR’s LIFs and routes fail

  • Any dynamic route updates will not be sent to hosts (including withdrawal of prefixes received via now broken adjacencies)

DLR Control VM loses connectivity with the NSX Manager and Controllers

  • Same effects as above, except if DLR Control VM and its routing adjacencies are still up, traffic to and from previously learned prefixes will not be affected

DLR Control VM loses connection with the NSX Manager

  • NSX Manager’s Create, update, and delete operations for this DLR’s LIFs and routes fail and are not re-tried

  • Dynamic routing updates continue to propagate

DLR Control VM loses connection with the Controllers

  • Any routing changes (static or dynamic) for this DLR do not propagate to hosts