On NSX for vSphere, control plane (netcpa) works as a local agent daemon, communicating with NSX Manager and with the controller cluster. Communication Channel Health feature is a proactive health check which periodically reports the central control plane to local control plane status to NSX Manager and is displayed at the NSX Manager UI. This report also serves as a heartbeat to detect the operational status of the NSX Manager to ESXi host netcpa channel. It provides error details during communication faults, generates an event when a channel goes into a wrong status, and also generates heartbeat messages from NSX Manager to hosts.

Problem

Connectivity issues between control plane agent and controller.

Cause

If there is any missing connection, then control plane agent may not be working properly.

Procedure

  1. Validate the connection status when the channel goes into a wrong state using the following command:

    GET https://<NSX_Manager_IP>/api/2.0/vdn/inventory/host/{hostId}/connection/status

    Following is the example of the return value:

    <?xml version="1.0" encoding="UTF-8"?>
    <hostConnStatus>
    <hostName>10.161.246.20</hostName>
    <hostId>host-21</hostId>
    <nsxMgrToFirewallAgentConn>UP</nsxMgrToFirewallAgentConn>
    <nsxMgrToControlPlaneAgentConn>UP</nsxMgrToControlPlaneAgentConn>
    <hostToControllerConn>DOWN</hostToControllerConn>
    <fullSyncCount>-1</fullSyncCount>
    <hostToControllerConnectionErrors>
    <hostToControllerConnectionError>
    <controllerIp>10.160.203.236</controllerIp>
    <errorCode>1255604</errorCode>
    <errorMessage>Connection Refused</errorMessage>
    </hostToControllerConnectionError>
    <hostToControllerConnectionError>
    <controllerIp>10.160.203.237</controllerIp>
    <errorCode>1255603</errorCode>
    <errorMessage>SSL Handshake Failure</errorMessage>
    </hostToControllerConnectionError>
    </hostToControllerConnectionErrors>
    </hostConnStatus>

    The following error codes are supported:

    1255602: Incomplete Controller Certificate
    1255603: SSL Handshake Failure
    1255604: Connection Refused
    1255605: Keep-alive Timeout
    1255606: SSL Exception
    1255607: Bad Message
    1255620: Unknown Error

  2. Determine the reason for the control plane agent being down as follows:
    1. Check the control plane agent status on hosts by running the /etc/init.d/netcpad status command on ESXi hosts.

      [root@esx-01a:~] /etc/init.d/netcpad status
      netCP agent service is running
      
    2. Check the control plane agent configurations using the more /etc/vmware/netcpa/config-by-vsm.xml command. The IP addresses of the NSX Controllers should be listed.

      [root@esx-01a:~] more /etc/vmware/netcpa/config-by-vsm.xml
      <config>
        <connectionList>
          <connection id="0000">
            <port>1234</port>
            <server>192.168.110.31</server>
            <sslEnabled>true</sslEnabled>
            <thumbprint>A5:C6:A2:B2:57:97:36:F0:7C:13:DB:64:9B:86:E6:EF:1A:7E:5C:36</thumbprint>
          </connection>
          <connection id="0001">
            <port>1234</port>
            <server>192.168.110.32</server>
            <sslEnabled>true</sslEnabled>
            <thumbprint>12:E0:25:B2:E0:35:D7:84:90:71:CF:C7:53:97:FD:96:EE:ED:7C:DD</thumbprint>
          </connection>
          <connection id="0002">
            <port>1234</port>
            <server>192.168.110.33</server>
            <sslEnabled>true</sslEnabled>
            <thumbprint>BD:DB:BA:B0:DC:61:AD:94:C6:0F:7E:F5:80:19:44:51:BA:90:2C:8D</thumbprint>
          </connection>
        </connectionList>
       ...
      
  3. Validate connections to the controllers from the control plane agent using the following command. The output is one connection for each controller.
    >[root@esx-01a:~] esxcli network ip connection list | grep 1234
    tcp     0   0  192.168.110.51:16594     192.168.110.31:1234   ESTABLISHED     36754  newreno  netcpa-worker
    tcp     0   0  192.168.110.51:46917     192.168.110.33:1234   ESTABLISHED     36754  newreno  netcpa-worker
    tcp     0   0  192.168.110.51:47891     192.168.110.32:1234   ESTABLISHED     36752  newreno  netcpa-worker
    
  4. Validate the connections to the controllers from the control plane agent to show CLOSED or CLOSE_WAIT status by running the following command:
    esxcli network ip
    		connection list |grep "1234.*netcpa*" | egrep "CLOSED|CLOSE_WAIT"
  5. If the control plane agent has been down for a significantly long time, the connections may not be present at all. To validate this, run the following command. The output is one connection for each controller.
    esxcli network ip
    		connection list |grep "1234.*netcpa*" |grep ESTABLISHED
  6. Control Plane Agent (netcpa) auto-recovery mechanism: The automatic control plane agent monitoring process detects the control plane agent in wrong status. When the control plane agent is in a wrong status, it stops responding and then automatically tries to recover.
    1. When the control plane agent stops responding, live core file is generated. You can find the core file as follows:

      ls /var/core       
       netcpa-worker-zdump.000

    2. Syslog error is reported in the vmkwarning.log file .
      cat /var/run/log/vmkwarning.log | grep NETCPA
      2017-08-11T06:32:17.994Z cpu1:1000044539)ALERT: Critical - NETCPA is hanged
      Taking live-dump & restarting netcpa process!
      
    Note:

    If the control plane agent monitor experiences a temporary failure due to a delayed response to the status check, a warning message similar to the following may be reported in the VMKernel logs.

    Warning - NETCPA getting netcpa status failed!

    You can ignore this warning.

  7. If the issue is not recovered automatically, restart the control plane agent as follows:
    1. Log in as root to the ESXi host through SSH or through the console.
    2. Run the /etc/init.d/netcpad restart command to restart the control plane agent on the ESXi host.