This topic provides information on identifying cause for NSX Controller failure and troubleshooting NSX Controller.

Problem

  • Deployment of NSX Controller(s) fails.

  • NSX Controller fails to join the cluster.

  • Running the show control-cluster status command shows the Majority status flapping between Connected to cluster majority to Interrupted connection to cluster majority.

Controller Clustering Issues

Refer to Deploying NSX Controllers.

Host Connectivity Issues

Check for host connectivity errors using the following commands. Run these commands on each of the controller nodes.

  • Check for any abnormal error statistics using the show log cloudnet/cloudnet_java-vnet-controller*.log filtered-by host_IP command.

  • Verify the logical switch/router message statistics or high message rate using the following commands:

    • show control-cluster core stats: overall stats

    • show control-cluster core stats-sample: latest stats samples

    • show control-cluster core connection-stats ip: per connection stats

    • show control-cluster logical-switches stats

    • show control-cluster logical-routers stats

    • show control-cluster logical-switches stats-sample

    • show control-cluster logical-routers stats-sample

    • show control-cluster logical-switches vni-stats vni

    • show control-cluster logical-switches vni-stats-sample vni

    • show control-cluster logical-switches connection-stats ip

    • show control-cluster logical-routers connection-stats ip

  • You can use the show host hostID health-status command to check the health status of hosts in your prepared clusters. For controller troubleshooting, the following health checks are supported:

    • Check whether the net-config-by-vsm.xml is synchronized to controller list.

    • Check if there is a socket connection to controller.

    • Check whether the VNI is created and whether the configuration is correct.

    • Check VNI connects to master controllers (if control plane is enabled).

Installation and Deployment Issues

  • Verify that there are at least three controller nodes deployed in a cluster. VMware recommends to leverage the native vSphere anti-affinity rules to avoid deploying more than one controller node on the same ESXi host.

  • Verify that all NSX Controllers display a Connected status. If any of the controller nodes display a Disconnected status, ensure that the following information is consistent by running the show control-cluster status command on all controller nodes:

Type

Status

Join status

Join complete

Majority status

Connected to cluster majority

Cluster ID

Same information on all controller nodes

  • Ensure that all roles are consistent on all controller nodes:

    Role

    Configured status

    Active status

    api_provider

    enabled

    activated

    persistence_server

    enabled

    activated

    switch_manager

    enabled

    activated

    logical_manager

    enabled

    activated

    directory_server

    enabled

    activated

  • Verify that vnet-controller process is running. Run the show process command on all controller nodes and ensure that java-dir-server service is running.

  • Verify the cluster history and ensure there is no sign of host connection flapping, or VNI join failures and abnormal cluster membership change. To verify this, run the show control-cluster history command. The commands also shows if the node is frequently restarted. Verify that there are not many log files with zero (0) size and with different process IDs.

  • Verify that VXLAN Network Identifier (VNI) is configured. For more information, see the VXLAN Preparation Steps section of the VMware VXLAN Deployment Guide.

  • Verify that SSL is enabled on the controller cluster. Run the show log cloudnet/cloudnet_java-vnet-controller*.log filtered-by sslEnabled command on each of the controller nodes.