Control Plane Agent (netcpa) Issues

On NSX Data Center for vSphere, control plane (netcpa) works as a local agent daemon, communicating with NSX Manager and with the controller cluster. Communication Channel Health feature is a health check which periodically reports the central control plane to local control plane status to NSX Manager and is displayed at the NSX Manager UI. This report also serves as a heartbeat to detect the operational status of the NSX Manager to ESXi host netcpa channel. It provides error details during communication faults, generates an event when a channel goes into a wrong status, and also generates heartbeat messages from NSX Manager to hosts.

Problem

Connectivity issues between control plane agent and controller.

Cause

If there is any missing connection, then control plane agent might not be working properly.

Solution

Validate the connection status when the channel goes into a wrong state using the following API request:

GET https://<NSX_Manager_IP>/api/2.0/vdn/inventory/host/{hostId}/connection/status

For more information about using this API, see the NSX API Guide.

To find the "hostId", you can use either the NSX Manager CLI or the vCenter Managed Object Browser (MOB).

To use the vCenter MOB, open a web browser and enter the URL of the vCenter MOB at http://vCenter-IP-Address/mob. See the instructions about finding cluster or host MOID in the NSX API Guide.
To use the NSX Manager CLI, log in as an admin user, and run the show cluster all or show cluster clusterID command. For more information about these commands, see the NSX Command Line Quick Reference.

Following is an example of the API response:

<?xml version="1.0" encoding="UTF-8"?>
<hostConnStatus>
  <hostName>10.161.246.20</hostName>
  <hostId>host-21</hostId>
  <nsxMgrToFirewallAgentConn>UP</nsxMgrToFirewallAgentConn>
  <nsxMgrToControlPlaneAgentConn>UP</nsxMgrToControlPlaneAgentConn>
  <hostToControllerConn>DOWN</hostToControllerConn>
  <fullSyncCount>-1</fullSyncCount>
  <hostToControllerConnectionErrors>
    <hostToControllerConnectionError>
      <controllerIp>10.160.203.236</controllerIp>
      <errorCode>1255604</errorCode>
      <errorMessage>Connection Refused</errorMessage>
    </hostToControllerConnectionError>
    <hostToControllerConnectionError>
      <controllerIp>10.160.203.237</controllerIp>
      <errorCode>1255603</errorCode>
      <errorMessage>SSL Handshake Failure</errorMessage>
    </hostToControllerConnectionError>
  </hostToControllerConnectionErrors>
</hostConnStatus>

The following error codes are supported:

1255602: Incomplete Controller Certificate
1255603: SSL Handshake Failure
1255604: Connection Refused
1255605: Keep-alive Timeout
1255606: SSL Exception
1255607: Bad Message
1255620: Unknown Error

Determine the reason for the control plane agent being down as follows:

Check the control plane agent status on hosts by running the /etc/init.d/netcpad status command on ESXi hosts.
```
[root@esx-01a:~] /etc/init.d/netcpad status
netCP agent service is running
```

Check the control plane agent configurations using the more /etc/vmware/netcpa/config-by-vsm.xml command. The IP addresses of the NSX Controllers should be listed.

[root@esx-01a:~] more /etc/vmware/netcpa/config-by-vsm.xml
<config>
  <connectionList>
    <connection id="0000">
      <port>1234</port>
      <server>192.168.110.31</server>
      <sslEnabled>true</sslEnabled>
      <thumbprint>A5:C6:A2:B2:57:97:36:F0:7C:13:DB:64:9B:86:E6:EF:1A:7E:5C:36</thumbprint>
    </connection>
    <connection id="0001">
      <port>1234</port>
      <server>192.168.110.32</server>
      <sslEnabled>true</sslEnabled>
      <thumbprint>12:E0:25:B2:E0:35:D7:84:90:71:CF:C7:53:97:FD:96:EE:ED:7C:DD</thumbprint>
    </connection>
    <connection id="0002">
      <port>1234</port>
      <server>192.168.110.33</server>
      <sslEnabled>true</sslEnabled>
      <thumbprint>BD:DB:BA:B0:DC:61:AD:94:C6:0F:7E:F5:80:19:44:51:BA:90:2C:8D</thumbprint>
    </connection>
  </connectionList>
 ...

Validate connections to the controllers from the control plane agent using the following command. The output is one connection for each controller.

>[root@esx-01a:~] esxcli network ip connection list | grep 1234
tcp     0   0  192.168.110.51:16594     192.168.110.31:1234   ESTABLISHED     36754  newreno  netcpa-worker
tcp     0   0  192.168.110.51:46917     192.168.110.33:1234   ESTABLISHED     36754  newreno  netcpa-worker
tcp     0   0  192.168.110.51:47891     192.168.110.32:1234   ESTABLISHED     36752  newreno  netcpa-worker

Validate the connections to the controllers from the control plane agent to show CLOSED or CLOSE_WAIT status by running the following command:
```
esxcli network ip
		connection list |grep "1234.*netcpa*" | egrep "CLOSED|CLOSE_WAIT"
```
If the control plane agent has been down for a significant time, the connections might not be present at all. To validate the connection status, run the following command. The output is one connection for each controller.
```
esxcli network ip
		connection list |grep "1234.*netcpa*" |grep ESTABLISHED
```
Control Plane Agent (netcpa) auto-recovery mechanism: The automatic control plane agent monitoring process detects the control plane agent in wrong status. When the control plane agent is in a wrong status, it stops responding and then automatically tries to recover.
1. When the control plane agent stops responding, live core file is generated. You can find the core file as follows:
```
ls /var/core       
 netcpa-worker-zdump.000
```
2. Syslog error is reported in the vmkwarning.log file .
```
cat /var/run/log/vmkwarning.log | grep NETCPA
2017-08-11T06:32:17.994Z cpu1:1000044539)ALERT: Critical - NETCPA is hanged
Taking live-dump & restarting netcpa process!
```
Note:
If the control plane agent monitor experiences a temporary failure due to a delayed response to the status check, a warning message similar to the following might be reported in the VMkernel logs.
```
Warning - NETCPA getting netcpa status failed!
```
You can ignore this warning.
If the problem is not recovered automatically, restart the control plane agent as follows:
1. Log in as root to the ESXi host through SSH or through the console.
2. Run the /etc/init.d/netcpad restart command to restart the control plane agent on the ESXi host.