On NSX Data Center for vSphere, control plane (netcpa) works as a local agent daemon, communicating with NSX Manager and with the controller cluster. Communication Channel Health feature is a health check which periodically reports the central control plane to local control plane status to NSX Manager and is displayed at the NSX Manager UI. This report also serves as a heartbeat to detect the operational status of the NSX Manager to ESXi host netcpa channel. It provides error details during communication faults, generates an event when a channel goes into a wrong status, and also generates heartbeat messages from NSX Manager to hosts.
Problem
Connectivity issues between control plane agent and controller.
Cause
If there is any missing connection, then control plane agent might not be working properly.
Solution
- Validate the connection status when the channel goes into a wrong state using the following API request:
GET https://<NSX_Manager_IP>/api/2.0/vdn/inventory/host/{hostId}/connection/status
For more information about using this API, see the NSX API Guide.
To find the "hostId", you can use either the NSX Manager CLI or the vCenter Managed Object Browser (MOB).- To use the vCenter MOB, open a web browser and enter the URL of the vCenter MOB at http://vCenter-IP-Address/mob. See the instructions about finding cluster or host MOID in the NSX API Guide.
- To use the NSX Manager CLI, log in as an admin user, and run the show cluster all or show cluster clusterID command. For more information about these commands, see the NSX Command Line Quick Reference.
Following is an example of the API response:
<?xml version="1.0" encoding="UTF-8"?> <hostConnStatus> <hostName>10.161.246.20</hostName> <hostId>host-21</hostId> <nsxMgrToFirewallAgentConn>UP</nsxMgrToFirewallAgentConn> <nsxMgrToControlPlaneAgentConn>UP</nsxMgrToControlPlaneAgentConn> <hostToControllerConn>DOWN</hostToControllerConn> <fullSyncCount>-1</fullSyncCount> <hostToControllerConnectionErrors> <hostToControllerConnectionError> <controllerIp>10.160.203.236</controllerIp> <errorCode>1255604</errorCode> <errorMessage>Connection Refused</errorMessage> </hostToControllerConnectionError> <hostToControllerConnectionError> <controllerIp>10.160.203.237</controllerIp> <errorCode>1255603</errorCode> <errorMessage>SSL Handshake Failure</errorMessage> </hostToControllerConnectionError> </hostToControllerConnectionErrors> </hostConnStatus>
The following error codes are supported:
1255602: Incomplete Controller Certificate 1255603: SSL Handshake Failure 1255604: Connection Refused 1255605: Keep-alive Timeout 1255606: SSL Exception 1255607: Bad Message 1255620: Unknown Error
- Determine the reason for the control plane agent being down as follows:
- Check the control plane agent status on hosts by running the /etc/init.d/netcpad status command on ESXi hosts.
[root@esx-01a:~] /etc/init.d/netcpad status netCP agent service is running
- Check the control plane agent configurations using the more /etc/vmware/netcpa/config-by-vsm.xml command. The IP addresses of the NSX Controllers should be listed.
[root@esx-01a:~] more /etc/vmware/netcpa/config-by-vsm.xml <config> <connectionList> <connection id="0000"> <port>1234</port> <server>192.168.110.31</server> <sslEnabled>true</sslEnabled> <thumbprint>A5:C6:A2:B2:57:97:36:F0:7C:13:DB:64:9B:86:E6:EF:1A:7E:5C:36</thumbprint> </connection> <connection id="0001"> <port>1234</port> <server>192.168.110.32</server> <sslEnabled>true</sslEnabled> <thumbprint>12:E0:25:B2:E0:35:D7:84:90:71:CF:C7:53:97:FD:96:EE:ED:7C:DD</thumbprint> </connection> <connection id="0002"> <port>1234</port> <server>192.168.110.33</server> <sslEnabled>true</sslEnabled> <thumbprint>BD:DB:BA:B0:DC:61:AD:94:C6:0F:7E:F5:80:19:44:51:BA:90:2C:8D</thumbprint> </connection> </connectionList> ...
- Check the control plane agent status on hosts by running the /etc/init.d/netcpad status command on ESXi hosts.
- Validate connections to the controllers from the control plane agent using the following command. The output is one connection for each controller.
>[root@esx-01a:~] esxcli network ip connection list | grep 1234 tcp 0 0 192.168.110.51:16594 192.168.110.31:1234 ESTABLISHED 36754 newreno netcpa-worker tcp 0 0 192.168.110.51:46917 192.168.110.33:1234 ESTABLISHED 36754 newreno netcpa-worker tcp 0 0 192.168.110.51:47891 192.168.110.32:1234 ESTABLISHED 36752 newreno netcpa-worker
- Validate the connections to the controllers from the control plane agent to show CLOSED or CLOSE_WAIT status by running the following command:
esxcli network ip connection list |grep "1234.*netcpa*" | egrep "CLOSED|CLOSE_WAIT"
- If the control plane agent has been down for a significant time, the connections might not be present at all. To validate the connection status, run the following command. The output is one connection for each controller.
esxcli network ip connection list |grep "1234.*netcpa*" |grep ESTABLISHED
- Control Plane Agent (netcpa) auto-recovery mechanism: The automatic control plane agent monitoring process detects the control plane agent in wrong status. When the control plane agent is in a wrong status, it stops responding and then automatically tries to recover.
- When the control plane agent stops responding, live core file is generated. You can find the core file as follows:
ls /var/core netcpa-worker-zdump.000
- Syslog error is reported in the vmkwarning.log file .
cat /var/run/log/vmkwarning.log | grep NETCPA 2017-08-11T06:32:17.994Z cpu1:1000044539)ALERT: Critical - NETCPA is hanged Taking live-dump & restarting netcpa process!
Note:If the control plane agent monitor experiences a temporary failure due to a delayed response to the status check, a warning message similar to the following might be reported in the VMkernel logs.
Warning - NETCPA getting netcpa status failed!
You can ignore this warning.
- When the control plane agent stops responding, live core file is generated. You can find the core file as follows:
- If the problem is not recovered automatically, restart the control plane agent as follows:
- Log in as root to the ESXi host through SSH or through the console.
- Run the /etc/init.d/netcpad restart command to restart the control plane agent on the ESXi host.