When one of the NSX Controller nodes fails, you still have two controllers that are working. The cluster majority is maintained, and the control plane continues to function.

Problem

NSX Controller cluster has failed.

Solution

  1. Log in to the vSphere Web Client.
  2. Navigate to Networking & Security > Installation and Upgrade > Management > NSX Controller Nodes.
  3. For each node, observe the Peers column. If the Peers column shows green boxes, it represents no error in the peer controller connectivity in the cluster. A red box indicates an error with a peer. Click the box to view details.
  4. If the Peers column displays a problem in the controller cluster, log in to each NSX Controller CLI to perform a detailed diagnosis. Run the show control-cluster status command to diagnose the state of each controller. All controllers in the cluster must have the same cluster UUID, however cluster UUID might not be same as the UUID of the master controller. You can find information about deployment issues as described in NSX Controller Deployment Issues.
  5. You can try the following steps to resolve the issue before redeploying the controller node or the controller cluster:
    1. Check that the controller is powered on.
    2. Try to ping to and from the affected controller to other nodes and manager to check network paths. If you find any network issues, address them as described in NSX Controller Deployment Issues.
    3. Check the Internet Protocol Security (IPSec) status using the following CLI commands.
      • Verify if IPSec is enabled using the show control-cluster network ipsec status command.
      • Verify the status of the IPSec tunnels using the show control-cluster network ipsec tunnels command.
      You can also use the IPSec status information to open a ticket with the VMware technical support.
    4. Managing IPSec VPN shared keys for a controller cluster:

      Controller node communicates with each other for clustering and storage operations. The communication is protected by IPSec VPN. When IPSec VPN is enabled for the controller cluster, a shared key for IPSec is generated. If the keys are out-of-sync or you have a suspected compromise scenario, you must rotate the pre-shared keys.

      • To change the IPSec VPN key, disable and immediately enable IPSec VPN. It generates a new key and that is pushed to all controllers.

      For more information about enabling and disabling IPSec VPN, see NSX Administration Guide.

    5. If the issue is not a network issue, you can choose whether to reboot or redeploy.

    If you want to reboot a node, ensure that only one controller is rebooted at a time. However, if the controller cluster is in a state where more than one controller node has failed, reboot all of them at the same time. When you are rebooting a node from a healthy cluster, always confirm that the cluster is reformed properly afterwards, and then confirm that the cluster resharding is done properly.

  6. If you decide to redeploy controllers, use one of the following two approaches:
    • Approach 1: Delete the broken controller node and redeploy a new controller node.
    • Approach 2: Delete the controller cluster and redeploy a new controller cluster.

      VMware recommends the second approach.

What to do next