The following tables describe the events that trigger alarms, including alarm messages and recommended actions to resolve them. Any event with a severity greater than LOW triggers an alarm.

Alarm Management Events

Alarm management events arise from the NSX Manager and Global Manager nodes.

Event Name Severity Alert Message Recommended Action
Alarm Service Overloaded Critical

The alarm service is overloaded.

When event detected: "Due to heavy volume of alarms reported, the alarm service is temporarily overloaded. The NSX UI and GET /api/v1/alarms NSX API have stopped reporting new alarms; however, syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of alarms are addressed, the alarm service will start reporting new alarms again."

When event resolved: "The heavy volume of alarms has subsided and new alarms are being reported again."

Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new alarms again.

Heavy Volume of Alarms Critical

Heavy volume of a specific alarm type detected.

When event detected: "Due to heavy volume of {event_id} alarms, the alarm service has temporarily stopped reporting alarms of this type. The NSX UI and GET /api/v1/alarms NSX API are not reporting new instances of these alarms. Syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of {event_id} alarms are addressed, the alarm service starts reporting new {event_id} alarms when new issues are detected again."

When event resolved: "The heavy volume of {event_id} alarms has subsided and new alarms of this type are being reported again."

Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm, investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new {event_id} alarms again.

Audit Log Health Events

Audit log health events arise from the NSX Manager and Global Manager nodes.

Event Name Severity Alert Message Recommended Action

Audit Log Health

Critical

At least one of the monitored log files cannot be written to.

When event detected: "At least one of the monitored log files has read-only permissions or has incorrect user/group ownership or rsyslog.log is missing on Manager, Global Manager, Edge or Public Cloud Gateway nodes."

When event resolved: "All monitored log files have the correct file permissions and ownership and rsyslog.log exists on Manager, Global Manager, Edge or Public Cloud Gateway nodes."

  1. On all NSX appliances, for example, Manager nodes and Edge nodes, ensure the permissions for the /var/log directory is 775 and the ownership is root:syslog.
  2. On Manager and Global Manager nodes, ensure the file permissions for auth.log, nsx-audit.log, nsx-audit-write.log, rsyslog.log, and syslog.log under /var/log is 640 and ownership is syslog:admin.
  3. On Edge and Public Cloud Gateway nodes, ensure the file permissions for rsyslog.log and syslog.log under /var/log is 640 and ownership is syslog:admin.
  4. On ESXi host nodes, ensure the file permissions of auth.log, nsx-syslog.log and syslog.log under /var/log is 755 and ownership is root:root.
  5. On KVM host nodes, ensure the file permissions of auth.log and syslog.log under /var/log is 775 and ownership is root:syslog. 6. If any of these files have incorrect permissions or ownership, invoke the commands chmod <mode> <path> and chown <user>:<group> <path>. 7. If rsyslog.log is missing on Manager, Global Manager, Edge or Public Cloud Gateway nodes, invoke the NSX CLI command restart service syslog which restarts the logging service and regenerates /var/log/rsyslog.log.

Remote Logging Server Error

Critical

Log messages undeliverable due to incorrect remote logging server configuration.

When event detected: "Log messages to logging server {hostname_or_ip_address_with_port} ({entity_id}) cannot be delivered possibly due to an unresolvable FQDN, an invalid TLS certificate or missing NSX appliance iptables rule."

When event resolved: "Configuration for logging server {hostname_or_ip_address_with_port} ({entity_id}) appear correct."

  1. Ensure that {hostname_or_ip_address_with_port} is the correct hostname or IP address and port.
  2. If the logging server is specified using a FQDN, ensure the FQDN is resolvable from the NSX appliance using the NSX CLI command nslookup <fqdn>. If not resolvable, verify the correct FQDN is specified and the network DNS server has the required entry for the FQDN.
  3. If the logging server is configured to use TLS, verify the specified certificate is valid. For example, ensure the logging server is actually using the certificate or verify the certificate has not expired using the openssl command openssl x509 -in <cert-file-path> -noout -dates.
  4. NSX appliances use iptables rules to explicitly allow outgoing traffic. Verify the iptables rule for the logging server is configured properly by invoking the NSX CLI command verify logging-servers which re-configures logging server iptables rules as needed.
  5. If for any reason the logging server is misconfigured, it should be deleted using the NSX CLI del logging-server <hostname-or-ip-address[:port]> proto <proto> level <level> command and re-added with the correct configuration.

To learn more about how to configure NSX-T Data Center appliances and hypervisors to send log messages to a remote logging server, see Configure Remote Logging.

If logs are not received by the remote log server, see Troubleshooting Syslog Issues.

Capacity Events

The following events can trigger alarms when the current inventory of certain categories of objects reaches a certain level. For more information, see View the Usage and Capacity of Categories of Objects.

Event Name Severity Alert Message Recommended Action
Maximum Capacity Critical A maximum capacity has been breached.

When event detected: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is at or above the maximum supported count of {max_supported_capacity_count}."

When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is below the maximum supported count of {max_supported_capacity_count}.",

  1. Ensure that the number of NSX objects created is within the limits supported by NSX. If there are any unused objects, delete them using the respective NSX UI or API from the system.
  2. Consider increasing the form factor of all Manager nodes and/or Edge nodes. Note that the form factor of each node type should be the same. If not the same, the capacity limits for the lowest form factor deployed are used.
Maximum Capacity Threshold High

A maximum capacity threshold has been breached.

When event detected: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is at or above the maximum capacity threshold of {max_capacity_threshold}%."

When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is below the maximum capacity threshold of {max_capacity_threshold}%."

Navigate to the capacity page in the NSX UI and review current usage versus threshold limits. If the current usage is expected, consider increasing the maximum threshold values. If the current usage is unexpected, review the network policies configured to decrease usage below the maximum threshold.

Minimum Capacity Threshold Medium

A minimum capacity threshold has been breached.

When event detected: The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is at or above the minimum capacity threshold of {min_capacity_threshold}%."

When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is below the minimum capacity threshold of {min_capacity_threshold}%."

Navigate to the capacity page in the NSX UI and review current usage versus threshold limits. If the current usage is expected, consider increasing the minimum threshold values. If the current usage is unexpected, review the network policies configured to decrease usage below the minimum threshold.

Certificates Events

Certificate events arise from the NSX Manager node.

Event Name Severity Alert Message Recommended Action
Certificate Expired Critical

A certificate has expired.

When event detected: "Certificate {entity-id} has expired."

When event resolved: "The expired certificate {entity-id} has been removed or is no longer expired.

Ensure services that are currently using the certificate are updated to use a new, non-expired certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:

POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>

where <cert-id> is the ID of a valid certificate reported by the API call GET /api/v1/trust-management/certificates.

After the expired certificate is no longer in use, it should be deleted with the following API call:

DELETE /api/v1/trust-management/certificates/{entity_id}

Certificate Is About To Expire High

A certificate is about to expire.

When event detected: "Certificate {entity-id} is about to expire."

When event resolved: "The expiring certificate {entity-id} or is no longer about to expire."

Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:

POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>

where <cert-id> is the ID of a valid certificate reported by the API call GET /api/v1/trust-management/certificates.

After the expiring certificate is no longer in use, it should be deleted using the API call:

DELETE /api/v1/trust-management/certificates/{entity_id}

Certificate Expiration Approaching Medium

A certificate is approaching expiration.

When event detected: "Certificate {entity-id} is approaching expiration."

When event resolved: "The expiring certificate {entity-id} or is no longer approaching expiration."

Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:

POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>

where <cert-id> is the ID of a valid certificate reported by the API call GET /api/v1/trust-management/certificates.

After the expiring certificate is no longer in use, it should be deleted using the API call:

DELETE /api/v1/trust-management/certificates/{entity_id}

CNI Health Events

CNI health events arise from the ESXi and KVM nodes.

Event Name Severity Alert Message Recommended Action
Hyperbus Manager Connection Down Medium

Hyperbus cannot communicate with the Manager node.

When event detected: "Hyperbus cannot communicate with the Manager node."

When event resolved: "Hyperbus can communicate with the Manager node."

The hyperbus vmkernel interface (vmk50) may be missing. See Knowledge Base article 67432.

DHCP Events

DHCP events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action
Pool Lease Allocation Failed High

IP addresses in an IP Pool have been exhausted.

When event detected: "The addresses in IP Pool {entity_id} of DHCP Server {dhcp_server_id} have been exhausted. The last DHCP request has failed and future requests will fail."

When event resolved: "IP Pool {entity_id} of DHCP Server {dhcp_server_id} is no longer exhausted. A lease is successfully allocated to the last DHCP request."

Review the DHCP pool configuration in the NSX UI or on the edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool.

Also review the current active leases on the edge node by invoking the NSX CLI command get dhcp lease.

Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking > Segments > Segment page in the NSX UI.

Pool Overloaded Medium

An IP Pool is overloaded.

When event detected: "DHCP Server {dhcp_server_id} IP Pool {entity_id} usage is approaching exhaustion with {dhcp_pool_usage}% IPs allocated."

When event resolved: "The DHCP Server {dhcp_server_id} IP Pool {entity_id} has fallen below the high usage threshold."

Review the DHCP pool configuration in the NSX UI or on the edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool.

Also review the current active leases on the edge node by invoking the NSX CLI command get dhcp lease.

Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking > Segments > Segment page in the NSX UI.

Distributed Firewall Events

Distributed firewall events arise from the NSX Manager or ESXi nodes.

Event Name Severity Alert Message Recommended Action

DFW CPU Usage Very High

Critical

DFW CPU usage is very high.

When event detected: "The DFW CPU usage on Transport node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The DFW CPU usage on Transport node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%."

Consider re-balancing the VM workloads on this host to other hosts.

Please review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter.

DFW Memory Usage Very High

Critical

DFW Memory usage is very high.

When event detected: "The DFW memory usage {heap_type} on Transport Node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The DFW memory usage {heap_type} on Transport Node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%."

View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the host.

Consider re-balancing the workloads on this host to other hosts.

Distributed IDS/IPS Events

Distributed IDS/IPS events arise from the NSX Manager or ESXi nodes.

Event Name Severity Alert Message Recommended Action

NSX IDPS Engine CPU Usage Very High

Critical

NSX-IDPS engine CPU usage exceeded 95% or above.

When event detected: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is at or above the very high threshold value of 95%."

When event resolved: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is below the very high threshold value of 95%."

Consider re-balancing the VM workloads on this host to other hosts.

NSX IDPS Engine Down

Critical

NSX IDPS is enabled via NSX Policy and IDPS rules are configured, but NSX-IDPS engine is down.

When event detected: "NSX IDPS is enabled via NSX policy and IDPS rules are configured, but NSX-IDPS engine is down."

When event resolved: "NSX IDPS is in one of the cases below. 1. NSX IDPS is disabled via NSX policy. 2. NSX IDPS engine is enabled, NSX-IDPS engine and vdpi are up, and NSX IDPS has been enabled and IDPS rules are configured via NSX Policy."

  1. Check /var/log/nsx-idps/nsx-idps.log and /var/log/nsx-syslog.log to see if there are errors reported.
  2. Invoke the following NSX CLI command to check if NSX Distributed IDPS is in disabled state.

    get ids engine status

    If so, invoke the following NSX CLI command to start the service.

    /etc/init.d/nsx-idps start
  3. Invoke the following NSX CLI command to check if nsx-vdpi is running.

    /etc/init.d/nsx-vdpi status

    If not, invoke the following NSX CLI command to start the service.

    /etc/init.d/nsx-vdpi start

NSX IDPS Engine Memory Usage Very High

Critical

NSX-IDPS engine memory usage reaches 95% or above.

When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the very high threshold value of 95%."

When event resolved: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is below the very high threshold value of 95%."

Consider re-balancing the VM workloads on this host to other hosts.

DNS Events

DNS events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action
Forwarder Down High

A DNS forwarder is down.

When event detected: "DNS forwarder {entity_id} is not running. This is impacting the identified DNS Forwarder that is currently enabled."

When event resolved: "DNS forwarder {entity_id} is running again."

  1. Invoke the NSX CLI command get dns-forwarders status to verify if the DNS forwarder is in down state.
  2. Check /var/log/syslog to see if there are errors reported.
  3. Collect a support bundle and contact the NSX support team.
Forwarder Disabled
Note: Alarm deprecated starting from NSX-T Data Center 3.2.
Low

A DNS forwarder is disabled.

When event detected: "DNS forwarder {entity_id} is disenabled."

When event resolved: ""DNS forwarder {entity_id} is enabled."

  1. Invoke the NSX CLI commandget dns-forwarders status to verify if the DNS forwarder is in a disabled state.
  2. Use NSX Policy API or Manager API to enable the DNS forwarder it should not be in the disabled state.

Edge Events

Edge events arise when there is a mismatch between NSX and Edge Appliance for some configuration values of the edge transport node.

Event Name Severity Alert Message Recommended Action

Edge Node Settings Mismatch

Critical

Edge node settings mismatch.

When event detected: "The edge node {entity_id} settings configuration does not match the policy intent configuration. The edge node configuration visible to the user on UI or API is not the same as what is realized. The realized edge node changes made by the user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for the edge node are listed in the runtime data."

When event resolved: "Edge node {entity_id} node settings are consistent with policy intent now."

Review the node settings of this edge transport node {entity_id}. Perform one of the following actions to resolve the alarm.
  • Manually update edge transport node setting policy intent using the API PUT https://<manager-ip>/api/v1/transport-nodes/<tn-id>.
  • Accept intent or realized edge node settings for this edge transport node through edge transport node resolver.
  • Accept the edge node settings configuration using the refresh API POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=refresh_node_configuration&resource_type=EdgeNode.

Edge Vm vSphere Settings Mismatch

Critical

Edge VM vSphere settings mismatch.

When event detected: "The edge node {entity_id} configuration on vSphere does not match the policy intent configuration. The edge node configuration visible to the user on UI or API is not the same as what is realized. The realized edge node changes made by the user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for the edge node are listed in the runtime data."

When event resolved: "Edge node {entity_id} VM vSphere settings are consistent with policy intent now."

Review the vSphere configuration of this edge transport node {entity_id}. Perform one of the following actions to resolve the alarm.
  • Accept intent or vSphere realized edge node configuration for this edge transport node through edge transport node resolver.
  • Resolve alarm by accepting the edge node vSphere realized configuration using the refresh API POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=refresh_node_configuration&resource_type=EdgeNode.

Edge Node Settings And vSphere Settings Are Changed

Critical

Edge node settings and vSphere settings are changed.

When event detected: "The edge node {entity_id} settings and vSphere configuration are changed and does not match the policy intent configuration. The edge node configuration visible to the user on UI or API is not the same as what is realized. The realized edge node changes made by the user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for edge node settings and vSphere configuration are listed in the runtime data."

When event resolved: "Edge node {entity_id} node settings and vSphere settings are consistent with policy intent now."

Review the node settings and vSphere configuration of this edge transport node {entity_id}. Perform one of the following actions to resolve the alarm.
  • Manually update edge transport node setting policy intent using the API PUT https://<manager-ip>/api/v1/transport-nodes/<tn-id>.
  • Accept intent or vSphere realized edge node configuration or realized edge node settings for this edge transport node through edge transport node resolver.
  • Accept the edge node settings and vSphere realized configuration using the refresh API POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=refresh_node_configuration&resource_type=EdgeNode.

Edge vSphere Location Mismatch

High

Edge vSphere location mismatch.

When event detected: "The edge node {entity_id} has been moved using vMotion. The edge node {entity_id} configuration on vSphere does not match the policy intent configuration. The edge node configuration visible to the user on UI or API is not the same as what is realized. The realized edge node changes made by the user outside of NSX Manager are shown in the details of this alarm. Fields that differ for the edge node are listed in the runtime data"

When event resolved: "Edge node {entity_id} node vSphere settings are consistent with policy intent now."

Review the vSphere configuration of this edge transport node {entity_id}. Perform one of the following actions to resolve alarm.
  • Accept the edge node vSphere realized config using the refresh API POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=refresh_node_configuration&resource_type=EdgeNode.
  • If you want to return to the previous location, use the NSX Redeploy API POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=redeploy. vMotion back to the original host is not supported.

Edge Health Events

Edge health events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action
Edge CPU Usage Very High Critical

Edge node CPU usage is very high.

When event detected: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this edge node. Consider adjusting the edge appliance form factor size or rebalancing services to other edge nodes for the applicable workload.
Edge CPU Usage High Medium

Edge node CPU usage is high.

When event detected: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this edge node. Consider adjusting the edge appliance form factor size or rebalancing services to other edge nodes for the applicable workload.
Edge Datapath Configuration Failure High

Edge node datapath configuration failed.

When event detected: "Failed to enable the datapath on the Edge node after three attempts."

When event resolved: "Datapath on the Edge node has been successfully enabled."

Ensure the edge node connection to the Manager node is healthy.

From the edge node NSX CLI, invoke the command get services to check the health of services.

If the dataplane service is stopped, invoke the command start service dataplane to restart it.

Edge Datapath CPU Very High

Critical

Edge node datapath CPU usage is very high.

When event detected: "The datapath CPU usage on Edge node {entity-id} has reached {datapath_resource_usage}% which is at or above the very high threshold for at least two minutes."

When event resolved: "Datapath CPU usage on Edge node {entity-id} has reduced below the maximum threshold."

Review the CPU statistics on the edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core.

Higher CPU usage is expected with higher packet rates.

Consider increasing the edge appliance form factor size and rebalancing services on this edge node to other edge nodes in the same cluster or other edge clusters.

Edge Datapath CPU Usage High Medium

Edge node datapath CPU usage is high.

When event detected: "The datapath CPU usage on Edge node {entity-id} has reached {datapath_resource_usage}% which is at or above the high threshold for at least two minutes."

When event resolved: "The CPU usage on Edge node {entity-id} has reached below the high threshold."

Review the CPU statistics on the edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core.

Higher CPU usage is expected with higher packet rates.

Consider increasing the edge appliance form factor size and rebalancing services on this edge node to other edge nodes in the same cluster or other edge clusters.

Edge Datapath Cryptodrv Down

Critical

Edge node crypto driver is down

When event detected: "Edge node crypto driver {edge_crypto_drv_name} is down."

When event resolved: "Edge node crypto driver {edge_crypto_drv_name} is up."

Upgrade the edge node as needed.

Edge Datapath Mempool High

Medium

Edge node datapath mempool is high.

When event detected: "The datapath mempool usage for {mempool_name} on Edge node {entity-id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The datapath mempool usage for {mempool_name} on Edge node {entity-id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%."

Log in as the root user and invoke the commands edge-appctl -t /var/run/vmware/edge/dpd.ctl mempool/show and edge-appctl -t /var/run/vmware/edge/dpd.ctl memory/show malloc_heap to check DPDK memory usage.
Edge Disk Usage Very High Critical

Edge node disk usage is very high.

When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Edge Disk Usage High Medium

Edge node disk usage is high.

When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Edge Global ARP Table Usage High Medium

The Edge node global ARP table usage is high.

When event detected: "Global ARP table usage on edge node {entity-id} has reached {datapath_resource_usage}% which is above the high threshold for over two minutes."

When event resolved: "Global ARP table usage on Edge node {entity-id} has reached below the high threshold."

  1. Log in as the root user and invoke the following command to check if neigh cache usage is normal

    edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/show
  2. If it is normal, invoke the following command to increase the ARP table size.

    edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/set_param max_entries
Edge Memory Usage Very High Critical

Edge node memory usage is very high.

When event detected: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this edge node. Consider adjusting the edge appliance form factor size or rebalancing services to other edge nodes for the applicable workload.
Edge Memory Usage High Medium

Edge node memory usage is high.

When event detected: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this edge node. Consider adjusting the edge appliance form factor size or rebalancing services to other edge nodes for the applicable workload.
Edge NIC Link Status Down Critical

Edge node NIC link is down.

When event detected: "Edge node NIC {edge_nic_name} link is down."

When event detected: "Edge node NIC {edge_nic_name} link is up."

On the edge node, confirm if the NIC link is physically down by invoking the NSX CLI command get interfaces.

If it is down, verify the cable connection.

Edge NIC Out of Receive Buffer Medium

Edge node NIC is out of RX ring buffers temporarily.

When event detected: "Edge NIC {edge_nic_name} receive ring buffer has overflowed by {rx_ring_buffer_overflow_percentage}% on Edge node {entity_id}. The missed packet count is {rx_misses} and processed packet count is {rx_processed}."

When event resolved: "Edge NIC {edge_nic_name} receive ring buffer usage on Edge node {entity_id} is no longer overflowing."

  1. Run the NSX CLI command get dataplane cpu stats on the edge node and check:
    1. If cpu usage is high, i.e., > 90%, then take a packet capture on the interface using the command start capture interface <interface-name> direction input or start capture interface <interface-name> direction input core <core-id> (to capture packets ingressing on specific core whose usage is high). Then analyze the capture to check if there are majority of fragmented packets or ipsec packets. If yes, then it is expected behavior. If not, datapath is probably busy with other operations. If this alarm lasts more than 2-3 minutes, contact VMware Support.
    2. 2. If cpu usage is not high, i.e., < 90%, then check if rx pps is high using the command get dataplane cpu stats ( to make sure the traffic rate is increasing). Then increase the ring size by 1024 using the command set dataplane ring-size rx <ring-size>.
      Note: The continuous increase of ring size by 1024 factor can lead to some performance issues. If even after increasing the ring size, the issue persists then it is an indication that edge needs a larger form factor deployment to accommodate the traffic.
    3. If the alarm keeps on flapping, that is, triggers and resolves very soon, then it is due to bursty traffic. In this case, check if rx pps as described above. If it is not high during the alarm active period then contact VMware Support. If pps is high it confirms bursty traffic. Consider suppressing the alarm.
      Note: There is no specific benchmark to decide what is regarded as a high pps value. It depends on infrastructure and type of traffic. The comparison can be made by noting down when alarm is inactive and when it is active.
Edge NIC Out of Transmit Buffer Critical

Edge node NIC is out of TX ring buffers temporarily.

When event detected: "Edge NIC {edge_nic_name} transmit ring buffer has overflowed by {tx_ring_buffer_overflow_percentage}% on Edge node {entity_id}. The missed packet count is {tx_misses} and processed packet count is {tx_processed}."

When event resolved: "Edge NIC {edge_nic_name} transmit ring buffer usage on Edge node {entity_id} is no longer overflowing."

  1. If a lot of VMs are accommodated along with edge by the hypervisor, then edge VM might not get time to run. Hence, the packets might not be retrieved by hypervisor. Then consider migrating the edge VM to a host with fewer VMs.
  2. Increase the ring size by 1024 using the command set dataplane ring-size tx <ring-size>. If even after increasing the ring size, the issue persists then contact VMware Support as the ESX side transmit ring buffer might be of lower value. If there is no issue on ESX side, it indicates the edge needs to be scaled to a larger form factor deployment to accommodate the traffic.
  3. If the alarm keeps on flapping, that is, triggers and resolves very soon, then it is due to bursty traffic. In this case check if tx pps using the command get dataplane cpu stats. If it is not high during the alarm active period then contact VMware Support. If pps is high it confirms bursty traffic. Consider suppressing the alarm.
    Note: There is no specific benchmark to decide what is regarded as a high pps value. It depends on infrastructure and type of traffic. The comparison can be made by noting down when alarm is inactive and when it is active.
Storage Error Critical

Starting in NSX-T Data Center 3.0.1.

When event detected: "The following disk partitions on the Edge node are in read-only mode: {disk_partition_name}."

When event resolved: "The following disk partitions on the Edge node have recovered from read-only mode: {disk_partition_name}"

Examine the read-only partition to see if reboot resolves the issue or the disk needs to be replaced. Contact GSS for more information.

Edge Datapath NIC Throughput High

Medium

Edge node datapath NIC throughput is high.

When event detected: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is at or above the high threshold value of {nic_throughput_threshold}%."

When event resolved: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is below the high threshold value of {nic_throughput_threshold}%."

Examine the traffic throughput levels on the NIC, and determine whether configuration changes are needed. Run the following command to monitor throughput.

get dataplane throughput <seconds>

Edge Datapath NIC Throughput Very High

Critical

Edge node datapath NIC throughput is very high.

When event detected: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is at or above the very high threshold value of {nic_throughput_threshold}%."

When event resolved: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is below the very high threshold value of {nic_throughput_threshold}%."

Examine the traffic throughput levels on the NIC, and determine whether configuration changes are needed. Invoke the following NSX CLI command to monitor throughput.

get dataplane throughput <seconds>

Failure Domain Down

Critical

All members of failure domain are down.

When event detected: "All members of failure domain {transport_node_id} are down."

When event resolved: "All members of failure domain {transport_node_id} are reachable."
  1. On the edge node identified by {transport_node_id}, check the connectivity to management and control planes by invoking the following NSX CLI command.

    get managers and get controllers
  2. Invoke the following NSX CLI command to check the management interface status.

    get interface eth0
  3. Invoke the following NSX CLI to check the core services status like dataplane/local-controller/nestdb/router, etc.

    get services
  4. Inspect the /var/log/syslog to find the suspecting error.
  5. Reboot the edge node.

Datapath Thread Deadlocked

Critical

Edge node's datapath thread is in deadlock condition.

When event detected: "Edge node datapath thread {edge_thread_name} is deadlocked."

When event resolved: "Edge node datapath thread {edge_thread_name} is free from deadlock."

Restart the dataplane service by invoking the following NSX CLI command.

restart service dataplane

Endpoint Protection Events

Endpoint protection events arise from the NSX Manager or ESXi nodes.

Event Name Severity Alert Message Recommended Action
EAM Status Down Critical

ESX Agent Manager (EAM) service on a compute manager is down.

When event detected: "ESX Agent Manager (EAM) service on compute manager {entity_id} is down."

When event resolved: "ESX Agent Manager (EAM) service on compute manager {entity_id} is either up or compute manager {entity_id} has been removed."

Restart the ESX Agent Manager (EAM) service:
  • SSH into the vCenter node and run:
    service vmware-eam start
Partner Channel Down Critical

Host module and Partner SVM connection is down.

When event detected: "The connection between host module and Partner SVM {entity_id} is down."

When event resolved: "The connection between host module and Partner SVM {entity_id} is up."

See Knowledge Base article 2148821 Troubleshooting NSX Guest Introspection and make sure that the Partner SVM identified by {entity_id} is reconnected to the host module.

Gateway Firewalls Events

Gateway firewalls events arise from NSX Edge nodes.

Event Name Severity Alert Message Recommended Action

ICMP Flow Count Exceeded

Critical Starting in NSX-T Data Center 3.1.3.

The gateway firewall flow table for ICMP traffic has exceeded the set threshold. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.

When event detected: “Gateway firewall flow table usage for ICMP traffic on logical router {entity_id} has reached {firewall_icmp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.”

When event resolved: “Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%.”

  1. Log in as administrator on the edge node and invoke the following NSX CLI command by using right interface uuid and check flow table usage for ICMP flows.

    get firewall <LR_INT_UUID> interface stats | json
  2. Check traffic flow going through the gateway is not a DOS attack or anomalous burst.
  3. If the traffic appears to be within the normal load, but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another edge node.
ICMP Flow Count High Medium Starting in NSX-T Data Center 3.1.3.

The gateway firewall flow table usage for ICMP traffic is high. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.

When event detected: “Gateway firewall flow table usage for ICMP on logical router {entity_id} has reached {firewall_icmp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.”

When event resolved: “Gateway firewall flow table usage for ICMP on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%.”

  1. Log in as administrator on the edge node and invoke the following NSX CLI command by using right interface uuid and check flow table usage for ICMP flows.

    get firewall <LR_INT_UUID> interface stats | json
  2. Check traffic flow going through the gateway is not a DOS attack or anomalous burst.
  3. If the traffic appears to be within the normal load, but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another edge node.
IP Flow Count Exceeded Critical Starting in NSX-T Data Center 3.1.3.

The gateway firewall flow table for IP traffic has exceeded the set threshold. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.

When event detected: “Gateway firewall flow table usage for IP traffic on logical router {entity_id} has reached {firewall_ip_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.”

When event resolved: “Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%."

  1. Log in as administrator on the edge node and invoke the NSX CLI command by using right interface uuid and check flow table usage for IP flows.

    get firewall <LR_INT_UUID> interface stats | json
  2. Check traffic flow going through the gateway is not a DOS attack or anomalous burst.
  3. If the traffic appears to be within the normal load, but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another edge node.
IP Flow Count High Medium Starting in NSX-T Data Center 3.1.3.

The gateway firewall flow table usage for IP traffic is high. New flows will be dropped by the gateway firewall when usage reaches the maximum limit

When event detected: “Gateway firewall flow table usage for IP on logical router {entity_id} has reached {firewall_ip_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit.”

When event resolved: “Gateway firewall flow table usage for non IP flows on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%.”

  1. Log in as administrator on the edge node and invoke the NSX CLI command by using right interface uuid and check flow table usage for IP flows.

    get firewall <LR_INT_UUID> interface stats | json
  2. Check traffic flow going through the gateway is not a DOS attack or anomalous burst.
  3. If the traffic appears to be within the normal load, but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another edge node.
TCP Flow Count Exceeded Critical Starting in NSX-T Data Center 3.1.3.

The gateway firewall flow table for TCP half-open traffic has exceeded the set threshold. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.

When event detected: “Gateway firewall flow table usage for TCP half-open traffic on logical router {entity_id} has reached {firewall_halfopen_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.”

When event resolved: “Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%.”

  1. Log in as administrator on the edge node and invoke the NSX CLI command by using right interface uuid and check flow table usage for TCP half-open flows.

    get firewall <LR_INT_UUID> interface stats | json
  2. Check traffic flow going through the gateway is not a DOS attack or anomalous burst.
  3. If the traffic appears to be within the normal load, but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another edge node.
TCP Flow Count High Medium Starting in NSX-T Data Center 3.1.3.

The gateway firewall flow table usage for TCP half-open traffic is high. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.

When event detected: “Gateway firewall flow table usage for TCP on logical router {entity_id} has reached {firewall_halfopen_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.”

When event resolved: “Gateway firewall flow table usage for TCP half-open on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%.”

  1. Log in as administrator on the edge node and invoke the NSX CLI command by using right interface uuid and check flow table usage for TCP half-open flows.

    get firewall <LR_INT_UUID> interface stats | json
  2. Check traffic flow going through the gateway is not a DOS attack or anomalous burst.
  3. If the traffic appears to be within the normal load, but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another edge node.
UDP Flow Count Exceeded Critical Starting in NSX-T Data Center 3.1.3.

The gateway firewall flow table for UDP traffic has exceeded the set threshold. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.

When event detected: “Gateway firewall flow table usage for UDP traffic on logical router {entity_id} has reached {firewall_udp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.”

When event resolved: “Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold.”

  1. Log in as administrator on the edge node and invoke the NSX CLI command by using right interface uuid and check flow table usage for UDP flows.

    get firewall <LR_INT_UUID> interface stats | json
  2. Check traffic flow going through the gateway is not a DOS attack or anomalous burst.
  3. If the traffic appears to be within the normal load, but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another edge node.
UDP Flow Count High Medium Starting in NSX-T Data Center 3.1.3.

The gateway firewall flow table usage for UDP traffic is high. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.

When event detected: “Gateway firewall flow table usage for UDP on logical router {entity_id} has reached {firewall_udp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.”

When event resolved: “Gateway firewall flow table usage for UDP on logical router {entity_id} has reached below the high threshold."

  1. Log in as administrator on the edge node and invoke the NSX CLI command by using right interface uuid and check flow table usage for UDP flows.

    get firewall <LR_INT_UUID> interface stats | json
  2. Check traffic flow going through the gateway is not a DOS attack or anomalous burst.
  3. If the traffic appears to be within the normal load, but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another edge node.

High Availability Events

High availability events arise from the NSX Edge and public cloud gateway nodes.

Event Name Severity Alert Message Recommended Action
Tier0 Gateway Failover High

A tier0 gateway has failed over.

When event detected: "The tier0 gateway {entity-id} failover from {previous_gateway_state} to {current_gateway_state}."

When event resolved: "The tier0 gateway {entity-id} is now up."

  1. Invoke the NSX CLI command get logical-router <service_router_id> to identify the tier0 service-router vrf ID.
  2. Switch to the vrf context by invoking vrf <vrf-id> then invoke get high-availability status to determine the service that is down.
Tier1 Gateway Failover High

A tier1 gateway has failed over.

When event detected: "The tier1 gateway {entity_id} failover from {previous_gateway_state} to {current_gateway_state}, service-router {service_router_id}."

When event resolved: "The tier1 gateway {entity-id} is now up."

  1. Invoke the NSX CLI command get logical-router <service_router_id> to identify the tier1 service-router vrf ID.
  2. Switch to the vrf context by invoking vrf <vrf-id>, and then invoke get high-availability status to determine the service that is down.

Identity Firewall Events

Event Name Severity Alert Message Recommended Action
Connectivity to LDAP Server Lost

Critical

Connectivity to LDAP server is lost.

When event detected: "The connectivity to LDAP server {ldap_server} is lost."

When event detected: "The connectivity to LDAP server {ldap_server} is restored.

Perform the following steps to check the LDAP server connectivity:

  1. The LDAP server is reachable from NSX nodes.
  2. The LDAP server details are configured correctly in NSX.
  3. The LDAP server is running correctly.
  4. There are no firewalls blocking access between the LDAP server and NSX nodes.

After the issue is fixed, use TEST CONNECTION in NSX UI under Identity Firewall AD to test the connection.

Error In Delta Sync

Critical

Errors occurred while performing delta sync.

When event detected: "Errors occurred while performing delta sync with {directory_domain}."

When event detected: "No errors occurred while performing delta sync with {directory_domain}."

  1. Check if there are any connectivity to LDAP server lost alarms.
  2. Find the error details in /var/log/syslog. Around the alarm trigger time, search for text: Error happened when synchronize LDAP objects.
  3. Check with AD administrator if there are any recent AD changes which may cause the errors.
  4. If the errors persist, collect the technical support bundle and contact VMware Support.

Infrastructure Communication Events

Infrastructure communication events arise from the NSX Edge, KVM, ESXi, and public gateway nodes.

Event Name Severity Alert Message Recommended Action
Edge Tunnels Down Critical

An Edge node's tunnel status is down.

When event detected: "The overall tunnel status of Edge node {entity_id} is down."

When event resolved: "The tunnels of Edge node {entity_id} have been restored."

  1. Invoke the following NSX CLI command to get all tunnel ports.

    get tunnel-ports
  2. Then check each tunnel's stats by invoking the following NSX CLI command to check if there are any drops.

    get tunnel-port <UUID> stats

    Also check /var/log/syslog if there are tunnel related errors.

Intelligence Communication Events

NSX Intelligence communication events arise from the NSX Manager node, ESXi node, and NSX Intelligence appliance.

Event Name Severity Alert Message Recommended Action
Transport Node Flow Exporter Disconnected High

A Transport node is disconnected from its Intelligence node's messaging broker. Data collection is affected.

When event detected: "The flow exporter on Transport node {entity-id} is disconnected from the Intelligence node's messaging broker. Data collection is affected."

When event resolved: "The flow exporter on Transport node {entity-id} has reconnected to the Intelligence node's messaging broker."

  1. Restart messaging service if it is not running in the NSX Intelligence node.
  2. Resolve the network connection failure between the Transport node and the NSX Intelligence node.
Control Channel to Transport Node Down Medium

Controller service to transport node's connection is down.

When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Transport node {entity_id} down for at least three minutes from Controller service's point of view."

When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Transport node {entity_id}."

  1. Check the connectivity from the Controller service central_control_plane_id and Transport node {entity-id} interface by using the ping command. If they are not pingable, check the network connectivity.
  2. Check to see if the TCP connections are established using the netstat output to see if the Controller service {central_control_plane_id} is listening for connections on port 1235. If not, check firewall (or) iptables rules to see if port 1235 is blocking Transport node {entity_id} connection requests. Ensure that there are no host firewalls or network firewalls in the underlay blocking the required IP ports between manager nodes and transport nodes. This is documented in our ports and protocols tool which is here: https://ports.vmware.com/.
  3. It is possible that the Transport node {entity_id} may still be in maintenance mode. You can check whether the Transport node is in maintenance mode via the following API:

    GET https://<nsx-mgr>/api/v1/transport-nodes/<tn-uuid>

    When maintenance mode is set, the Transport node will not be connected to the Controller service. This is usually the case when host upgrade is in progress. Wait for a few minutes and check connectivity again.
    Note: This alarm is critical and should be resolved. Contact VMware Support for the notification of this alarm if it remains unresolved over an extended period of time.

Control Channel to Transport Node Down Long

Critical

Controller service to Transport node's connection is down for too long.

When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Transport node {entity_id} down for at least 15 minutes from Controller service's point of view."

When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Transport node {entity_id}."

  1. Check the connectivity from the Controller service {central_control_plane_id} and Transport node {entity_id} interface through a ping and traceroute. This can be done on the NSX Manager node admin CLI. The ping test should not see drops and have consistent latency values. VMware recommends latency values of 150ms or less.
  2. Navigate to System > Fabric > Nodes > Transport node {entity_id} on the NSX UI to check if the TCP connections between the Controller service on Manager node {appliance_address} ({central_control_plane_id}) and Transport node {entity_id} is established. If not, check firewall rules on the network and the hosts to see if port 1235 is blocking Transport node {entity_id} connection requests. Ensure that there are no host firewalls or network firewalls in the underlay are blocking the required IP ports between Manager nodes and Transport nodes. This is documented in our ports and protocols tool which is here: https://ports.vmware.com/.

Control Channel To Manager Node Down

Medium

Transport node's control plane connection to the Manager node is down.

When event detected: "The Transport node {entity_id} control plane connection to Manager node {appliance_address} is down for at least {timeout_in_minutes} minutes from the Transport node's point of view."

When event resolved: "The Transport node {entity_id} restores the control plane connection to Manager node {appliance_address}."

  1. Check the connectivity from transport node {entity_id} to manager node {appliance_address} interface through a ping. If they are not pingable, check for flakiness in network connectivity.
  2. Check to see if the TCP connections are established using the netstat output to see if the Controller service on the manager node {appliance_address} is listening for connections on port 1235. If not, check firewall (or) iptables rules to see if port 1235 is blocking Transport node {entity_id} connection requests. Ensure that there are no host firewalls or network firewalls in the underlay are blocking the required IP ports between manager nodes and transport nodes. This is documented in our ports and protocols tool which is here: https://ports.vmware.com/.
  3. It is possible that the transport node {entity_id} may still be in maintenance mode. You can check whether the transport node is in maintenance mode through the following API:

    GET https://<nsx-mgr>/api/v1/transport-nodes/<tn-uuid>

    When maintenance mode is set, the transport node will not be connected to the Controller service. This is usually the case when host upgrade is in progress. Wait for a few minutes and check connectivity again.
    Note: Please note that this alarm is not critical and should be resolved. GSS need not be contacted for the notification of this alarm unless the alarm remains unresolved over an extended period of time.

Control Channel To Manager Node Down Too Long

Critical

Transport node's control plane connection to the Manager node is down for long.

When event detected: "The Transport node {entity_id} control plane connection to Manager node {appliance_address} is down for at least {timeout_in_minutes} minutes from the Transport node's point of view."

When event resolved: "The Transport node {entity_id} restores the control plane connection to Manager node {appliance_address}."

  1. Check the connectivity from transport node {entity_id} to manager node {appliance_address} interface through a ping. If they are not pingable, check for flakiness in network connectivity.
  2. Check to see if the TCP connections are established using the netstat output to see if the Controller service on the manager node {appliance_address} is listening for connections on port 1235. If not, check firewall (or) iptables rules to see if port 1235 is blocking transport node {entity_id} connection requests. Ensure that there are no host firewalls or network firewalls in the underlay are blocking the required IP ports between manager nodes and transport nodes. This is documented in our ports and protocols tool which is here: https://ports.vmware.com/.
  3. It is possible that the transport node {entity_id} may still be in maintenance mode. You can check whether the transport node is in maintenance mode through the following API:

    GET https://<nsx-mgr>/api/v1/transport-nodes/<tn-uuid>

    When maintenance mode is set, the Transport node will not be connected to the Controller service. This is usually the case when host upgrade is in progress. Wait for a few minutes and check connectivity again.

Management Channel To Transport Node Down

Medium

Management channel to Transport node is down.

When event detected: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is down for 5 minutes."

When event resolved: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is up."

  1. Ensure there is network connectivity between the manager nodes and transport node {transport_node_name} ({transport_node_address}) and no firewalls are blocking traffic between the nodes.
  2. On Windows transport nodes, ensure the nsx-proxy service is running on the transport node by invoking the following command in the Windows PowerShell.

    C:NSX\nsx-proxy\nsx-proxy.ps1 status

    If it is not running, restart it by invoking the following command

    C:NSX\nsx-proxy\nsx-proxy.ps1 restart
  3. On all other transport nodes, ensure the nsx-proxy service is running on the transport node by invoking the following command

    /etc/init.d/nsx-proxy status

    If it is not running, restart it by invoking the following command /etc/init.d/nsx-proxy restart

Management Channel To Transport Node Down Long

Critical

Management channel to Transport node is down for too long.

When event detected: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is down for 15 minutes."

When event resolved: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is up."

  1. Ensure there is network connectivity between the manager nodes and transport node {transport_node_name} ({transport_node_address}) and no firewalls are blocking traffic between the nodes.
  2. On Windows transport nodes, ensure the nsx-proxy service is running on the transport node by invoking the following command in the Windows PowerShell.

    C:NSX\nsx-proxy\nsx-proxy.ps1 status

    If it is not running, restart it by invoking the following command

    C:NSX\nsx-proxy\nsx-proxy.ps1 restart
  3. On all other transport nodes, ensure the nsx-proxy service is running on the transport node by invoking the following command

    /etc/init.d/nsx-proxy status

    If it is not running, restart it by invoking the following command /etc/init.d/nsx-proxy restart

Manager Cluster Latency High

Medium

The average network latency between Manager nodes is high.

When event detected: "The average network latency between Manager nodes {manager_node_id} ({appliance_address}) and {remote_manager_node_id} ({remote_appliance_address}) is more than 10ms for the last 5 minutes."

When event resolved: "The average network latency between Manager nodes {manager_node_id} ({appliance_address}) and {remote_manager_node_id} ({remote_appliance_address}) is within 10ms."

Ensure there are no firewall rules blocking ping traffic between manager nodes. If there are other high bandwidth servers and applications sharing the local network, consider moving these to a different network.

Manager Control Channel Down

Critical

Manager to controller channel is down.

When event detected: "The communication between the management function and the control function has failed on Manager node {manager_node_name} ({appliance_address})."

When event resolved: "The communication between the management function and the control function has been restored on Manager node {manager_node_name} ({appliance_address})."

On the manager node {manager_node_name} ({appliance_address}), invoke the following two NSX CLI commands:

restart service mgmt-plane-bus

restart service manager

Manager FQDN Lookup Failure

Critical

DNS lookup failed for Manager node's FQDN.

When event detected: "DNS lookup failed for Manager node {entity_id} with FQDN {appliance_fqdn} and the publish_fqdns flag was set."

When event resolved: "FQDN lookup succeeded for Manager node {entity_id} with FQDN {appliance_fqdn} or the publish_fqdns flag was cleared."

  1. Assign correct FQDNs to all manager nodes and verify the DNS configuration is correct for successful lookup of all manager nodes' FQDNs.
  2. Alternatively, disable the use of FQDNs by invoking the following NSX API with publish_fqdns set to false in the request body.

    PUT /api/v1/configs/management

    After that calls from transport nodes and from Federation to manager nodes in this cluster will use only IP addresses.

Manager FQDN Reverse Lookup Failure

Critical

Reverse DNS lookup failed for Manager node's IP address.

When event detected: "Reverse DNS lookup failed for Manager node {entity_id} with IP address {appliance_address} and the publish_fqdns flag was set."

When event resolved: "Reverse DNS lookup succeeded for Manager node {entity_id} with IP address {appliance_address} or the publish_fqdns flag was cleared."

  1. Assign correct FQDNs to all manager nodes and verify the DNS configuration is correct for successful reverse lookup of the manager node's IP address.
  2. Alternatively, disable the use of FQDNs by invoking the following NSX API with publish_fqdns set to false in the request body.

    PUT /api/v1/configs/management

    After that calls from transport nodes and from Federation to manager nodes in this cluster will use only IP addresses.
Management Channel To Manager Node Down Medium

Management channel to Manager node is down.

When event detected:

"Management channel to Manager Node {manager_node_id} ({appliance_address}) is down for 5 minutes."

When event resolved: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is up."

  • Ensure there is network connectivity between the transport node {transport_node_id} and primary manager node.
  • Also ensure no firewalls are blocking traffic between the nodes.
  • Ensure that the messaging manager service is running on manager nodes by invoking the following command.

    /etc/init.d/messaging-manager status
  • If the messaging manager is not running, restart it by invoking the following command.

    /etc/init.d/messaging-manager restart
Management Channel To Manager Node Down Long Critical

Management channel to Manager node is down for too long.

When event detected: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is down for 15 minutes."

When event resolved: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is up."

  • Ensure there is network connectivity between the transport node {transport_node_id} and primary manager node.
  • Also ensure no firewalls are blocking traffic between the nodes.
  • Ensure that the messaging manager service is running on manager nodes by invoking the following command.

    /etc/init.d/messaging-manager status
  • If the messaging manager is not running, restart it by invoking the following command.

    /etc/init.d/messaging-manager restart

Infrastructure Service Events

Infrastructure service events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action
Edge Service Status Down
Note: Alarm deprecated starting from NSX-T Data Center 3.2.
Critical

Edge service is down for at least one minute.

If the View Runtime Details link is available, you can click the link to view the reason for service down.

When event detected: "The service {edge_service_name} is down for at least one minute."

When event resolved: "The service {edge_service_name} is up."

  1. On the edge node, verify the service hasn't exited due to an error by looking for core dump files in the /var/log/core directory.
  2. To confirm whether the service is stopped, invoke the NSX CLI command get services.
  3. If so, run start service <service-name> to restart the service.
Edge Service Status Changed Medium

Edge service status has changed.

If the View Runtime Details link is available, you can click the link to view the reason for service down.

When event detected: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}."

When event resolved: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}."

  1. On the edge node, verify the service hasn't exited due to an error by looking for core files in the /var/log/core directory.
  2. In addition, invoke the NSX CLI command get services to confirm whether the service is stopped.
  3. If so, invoke start service <service-name> to restart the service.

Intelligence Health Events

NSX Intelligence health events arise from the NSX Manager node and NSX Intelligence appliance.

Event Name Severity Alert Message Recommended Action
CPU Usage Very High Critical

Intelligence node CPU usage is very high.

When event detected: "The CPU usage on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%."

Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

CPU Usage High Medium

Intelligence node CPU usage is high.

When event detected: "The CPU usage on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%."

Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

Memory Usage Very High Critical

Intelligence node memory usage is very high.

When event detected: "The memory usage on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%."

Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

Memory Usage High Medium

Intelligence node memory usage is high.

When event detected: "The memory usage on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%."

Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

Disk Usage Very High Critical

Intelligence node disk usage is very high.

When event detected: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%."

Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed.
Disk Usage High Medium

Intelligence node disk usage is high.

When event detected: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%."

Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed.
Data Disk Partition Usage Very High Critical

Intelligence node data disk partition usage is very high.

When event detected: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%.

When event resolved: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%."

Stop NSX Intelligence data collection until the disk usage is below the threshold.

In the NSX UI, navigate to System Appliances NSX Intelligence Appliance. Then select ACTIONS > Stop Collecting Data.

Data Disk Partition Usage High Medium

Intelligence node data disk partition usage is high.

When event detected: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%.

When event resolved: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%."

Stop NSX Intelligence data collection until the disk usage is below the threshold.

Examine the /data partition and see if there are any unexpected large files that can be removed.

Node Status Degraded High

Intelligence node status is degraded.

When event detected: "Service {service_name}on NSX Intelligence node {intelligence_node_id} is not running."

When event resolved: "Service {service_name}on NSX Intelligence node {intelligence_node_id} is running properly."

Examine service status and health information with NSX CLI command get services in the NSX Intelligence node.

Restart unexpected stopped services with NSX CLI command restart service <service-name>.

IP Address Management Events

IP address management (IPAM) events arise from the NSX Manager nodes.

Event Name Severity Alert Message Recommended Action
IP Block Usage Very High Medium

Starting in NSX-T Data Center 3.1.2.

IP subnet usage of an IP block has reached 90%.

When event detected: "IP block usage of <intent_path> is very high. IP block nearing its total capacity, creation of subnet using IP block might fail."

When event resolved:

No message.

  • Review the IP block usage. Use a new IP block for resource creation or delete the unused IP subnet from the IP block. To check subnet that is being used for an IP block:
    1. From the NSX UI, navigate to Networking > IP Address Pools > IP Address Pools tab.
    2. Select IP pools where the IP block is being used. Check Subnets and Allocated IPs columns.
    3. Delete the subnet or the IP pool if none of the allocations are used and will not be used in future.
  • Use the following APIs to check if the IP block is being used by the IP pool and also check for IP allocations.
    • To get configured subnets of an IP pool, invoke the following NSX API.

      GET /policy/api/v1/infra/ip-pools/<ip-pool>/ip-subnets

    • To get IP allocations, invoke the following NSX API.

      GET /policy/api/v1/infra/ip-pools/<ip-pool>/ip-allocations

Note: Delete an IP pool or subnet only if it does not have any allocated IPs and if it will not be used in future.
IP Pool Usage Very High Medium

Starting in NSX-T Data Center 3.1.2.

IP allocation usage of an IP pool has reached 90%.

When event detected: "IP pool usage of <intent_path> is very high. IP pool nearing its total capacity. Creation of entity/service depends on IP being allocated from IP pool might fail."

When event resolved:

No message.

Review IP pool usage. Release unused IP allocations from the IP pool or create a new IP pool.

  1. From the NSX UI, navigate to Networking > IP Address Pools > IP Address Pools tab.
  2. Select IP pools and check Allocated IPs column to view IPs allocated from the IP pool.

You can release those IPs that are not used. To release unused IP allocations, invoke the following NSX API.

DELETE /policy/api/v1/infra/ip-pools/<ip-pool>/ip-allocations/<ip-allocation>

License Events

License events arise from the NSX Manager node.

Event Name Severity Alert Message Recommended Action
License Expired Critical

A license has expired.

When event detected: "The license of type {license_edition_type} has expired."

When event resolved: "The expired license of type {license_edition_type} has been removed, updated, or is no longer expired."

Add a new, non-expired license:
  1. In the NSX UI, by navigate to System > Licenses.
  2. Click Add and specify the key of the new license.
  3. Delete the expired license by selecting the check box and clicking Unassign.
License Is About to Expire Medium

"A license is about to expired.When event detected: "The license of type {license_edition_type} is about to expire."

When event resolved: "The expiring license identified by {license_edition_type}has been removed, updated, or is no longer about to expire."

Add a new, non-expired license:
  1. In the NSX UI, by navigate to System > Licenses.
  2. Click Add and specify the key of the new license.
  3. Delete the expired license by selecting the check box and clicking Unassign.

Load Balancer Events

Load balancer events arise from NSX Edge nodes or from NSX Manager nodes.

Event Name Severity Alert Message Recommended Action
LB CPU Very High Medium

Load balancer CPU usage is very high.

When event detected: "The CPU usage of load balancer {entity_id} is very high. The threshold is {system_usage_threshold}%."

When event resolved: "The CPU usage of load balancer {entity_id} is low enough. The threshold is {system_usage_threshold}%."

If the load balancer CPU utilization is higher than system usage threshold, the workload is too high for this load balancer.

Rescale the load balancer service by changing the load balancer size from small to medium or from medium to large.

If the CPU utilization of this load balancer is still high, consider adjusting the edge appliance form factor size or moving load balancer services to other edge nodes for the applicable workload.

LB Status Down

Critical

Centralized load balancer service is down.

When event detected: "The centralized load balancer service {entity_id} is down."

When event resolved: " The centralized load balancer service {entity_id} is up."

  1. On the active edge node, check load balancer status by invoking the following NSX CLI command.

    get load-balancer <lb-uuid> status
  2. If the LB-State of load balancer service is not_ready or there is no output, make the Edge node enter maintenance mode and then exit maintenance mode.
Virtual Server Status Down Medium

Load balancer virtual service is down.

When event detected: "The load balancer virtual server {entity_id} is down."

When event resolved: "The load balancer virtual server {entity_id} is up."

Consult the load balancer pool to determine its status and verify its configuration.

If incorrectly configured, reconfigure it and remove the load balancer pool from the virtual server then re-add it to the virtual server again.

Pool Status Down Medium

Load balancer pool is down.

When event detected: "The load balancer pool {entity_id} status is down."

When event resolved: "The load balancer pool {entity_id} status is up."

  1. Consult the load balancer pool to determine which members are down.
  2. Check network connectivity from the load balancer to the impacted pool members.
  3. Validate application health of each pool member.
  4. Validate the health of each pool member using the configured monitor.

When the health of the member is established, the pool member status is updated to healthy based on the 'Rise Count' configuration in the monitor.

LB Status Degraded

Medium

Starting in NSX-T Data Center 3.1.2.

Load balancer service is degraded.

When event detected: "The load balancer service {entity_id} is degraded."

When event resolved: "The load balancer service {entity_id} is not degraded."

  • For centralized load balancer:
    1. On the standby edge node, check the load balancer status by invoking the following NSX CLI command.

      get load-balancer <lb-uuid> status
    2. If the LB-State of load balancer service is 'not_ready' or if there is no output, make the edge node enter maintenance mode and then exit maintenance mode.
  • For distributed load balancer:
  1. Get a detailed status by invoking the following NSX API.

    GET /policy/api/v1/infra/lb-services/<LBService>/detailed-status?source=realtime
  2. From the API output, find the ESXi host reporting a non-zero instance_number with status NOT_READY or CONFLICT.
  3. On the ESXi host node, invoke the following NSX CLI command.

    get load-balancer <lb-uuid> status

    If 'Conflict LSP' is reported, check whether this LSP is attached to any other load balancer service and whether this conflict is acceptable.

    If 'Not Ready LSP' is reported, check the status of this LSP by invoking the following NSX CLI command.

    get logical-switch-port status

DLB Status Down

Critical

Starting in NSX-T Data Center 3.1.2.

Distributed load balancer service is down.

When event detected: "The distributed load balancer service {entity_id} is down."

When event resolved: "The distributed load balancer service {entity_id} is up."

  1. On the ESXi host node, invoke the following NSX CLI command.

    get load-balancer <lb-uuid> status
  2. If the report states 'Conflict LSP', check whether this LSP is attached to other any other load balancer service and whether this conflict is acceptable. If the report states 'Not Ready LSP', check the status of this LSP by invoking the following NSX CLI command.

    get logical-switch-port status

LB Edge Capacity In Use High

Medium

Starting in NSX-T Data Center 3.1.2.

Load balancer usage is high

When event detected: "The usage of load balancer service in Edge node {entity_id} is high. The threshold is {system_usage_threshold}%."

When event resolved: "The usage of load balancer service in Edge node {entity_id} is low enough. The threshold is {system_usage_threshold}%."

If multiple LB instances have been configured in this edge node, deploy a new edge node and move some LB instances to that new edge node. If only a single LB instance (small/medium/etc) has been configured in an edge node of same size (small/medium/etc), deploy a new edge of bigger size and move the LB instance to that new edge node.

LB Pool Member Capacity In Use Very High

Critical

Starting in NSX-T Data Center 3.1.2.

Load balancer pool member usage is very high.

When event detected: "The usage of pool members in Edge node {entity_id} is very high. The threshold is {system_usage_threshold}%."

When event resolved: "The usage of pool members in Edge node {entity_id} is low enough. The threshold is {system_usage_threshold}%."

Deploy a new edge node and move the load balancer service from existing edge nodes to the newly deployed edge node.

Load Balancing Configuration Not Realized Due To Lack Of Memory

Medium

Load balancer configuration is not realized due to high memory usage on the edge node.

When event detected: "The load balancer configuration {entity_id} is not realized, due to high memory usage on Edge node {transport_node_id}."

When event resolved: "The load balancer configuration {entity_id} is realized on {transport_node_id}.",

  • Prefer defining small and medium sized load balancers over large sized load balancers.
  • Spread out load balancer services among the available edge nodes.
  • Reduce number of virtual servers defined.

Manager Health Events

NSX Manager health events arise from the NSX Manager node cluster.

Event Name Severity Alert Message Recommended Action
Duplicate IP Address Medium

Manager node's IP address is in use by another device.

When event detected: "Manager node {entity_id} IP address {duplicate_ip_address} is currently being used by another device in the network."

When event detected: "The device using the IP address assigned to Manager node {entity_id} appears to no longer be using {duplicate_ip_address}."

  1. Determine which device is using the Manager's IP address and assign the device a new IP address.
    Note: Reconfiguring the Manager to use a new IP address is not supported.
  2. Verify if the static IP address pool/DHCP server is configured correctly.
  3. Correct the IP address of the device if it is manually assigned.
Manager CPU Usage Very High Critical

Manager node CPU usage is very high.

When event detected: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Manager node.

Consider adjusting the Manager appliance form factor size.

Manager CPU Usage High Medium

Starting in NSX-T Data Center 3.0.1.

Manager node CPU usage is high.

When event detected: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Manager node.

Consider adjusting the Manager appliance form factor size.

Manager Memory Usage Very High Critical

Starting in NSX-T Data Center 3.0.1.

Manager node memory usage is very high.

When event detected: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Manager node.

Consider adjusting the Manager appliance form factor size.

Manager Memory Usage High Medium

Manager node memory usage is high.

When event detected: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Manager node.

Consider adjusting the Manager appliance form factor size.

Manager Disk Usage Very High Critical

Manager node disk usage is very high.

When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Manager Disk Usage High Medium

Manager node disk usage is high.

When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Examine the partition with high usage and see if there are any unexpected large files that can be removed.

Manager Config Disk Usage Very High

Critical

Manager node config disk usage is very high.

When event detected: ": "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of high disk usage by the NSX Datastore service under the /config/corfu directory."

When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%."

Please run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py

Manager Config Disk Usage High Medium

Manager node config disk usage is high.

When event detected: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk usage by the NSX Datastore service under the /config/corfu directory."

When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Examine the /config partition and see if there are any unexpected large files that can be removed.

Operations DB Disk Usage High

Medium

Manager node nonconfig disk usage is high.

When event detected: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk usage by the NSX Datastore service under the /nonconfig/corfu directory."

When event resolved: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%."

Please run the following tool, and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig.

Operations DB Disk Usage Very High Critical

Manager node nonconfig disk usage is very high.

When event detected: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of high disk usage by the NSX Datastore service under the /nonconfig/corfu directory."

When event resolved: ": "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%."

Please run the following tool, and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig.

NCP Events

NSX Container Plug-in (NCP) events arise from the ESXi and KVM nodes.

Event Name Severity Alert Message Recommended Action
NCP Plugin Down Critical

Manager Node has detected the NCP is down or unhealthy.

When event detected: "Manager Node has detected the NCP is down or unhealthy."

When event resolved: "Manager Node has detected the NCP is up or healthy again."

  • To find the clusters which are having issues, perform one of the following actions:
    • Use the NSX UI and navigate to the Alarms page. The Entity name value for this alarm instance identifies the cluster name.
    • Invoke the NSX API GET /api/v1/systemhealth/container-cluster/ncp/status to fetch all cluster statuses and determine the name of any clusters that report DOWN or UNKNOWN. Then on the NSX UI Inventory | Container | Clusters page, find the cluster by name and click the Nodes tab which lists all Kubernetes and PAS cluster members.
  • For Kubernetes cluster:
    1. Check NCP Pod liveness by finding the K8s primary node from all the cluster members and log onto the primary node. Then invoke the kubectl command kubectl get pods --all-namespaces. If there is an issue with the NCP Pod, use kubectl logs command to check the issue and fix the error.
    2. Check the connection between NCP and Kubernetes API server. The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following commands from the primary VM.

      kubectl exec -it <NCP-Pod-Name> -n nsx-system bash

      nsxcli

      get ncp-k8s-api-server status

      If there is an issue with the connection, please check both the network and NCP configurations.
    3. Check the connection between NCP and NSX Manager. The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following command from the primary VM.

      kubectl exec -it <NCP-Pod-Name> -n nsx-system bash

      nsxcli

      get ncp-nsx status

      If there is an issue with the connection, check both the network and NCP configurations.
  • For PAS cluster:
    1. Check the network connections between virtual machines and fix any network issues.
    2. Check the status of both nodes and services and fix crashed nodes or services. Invoke the command bosh vms and bosh instances -p to check the status of nodes and services.

Node Agents Health Events

Node agent health events arise from the ESXi and KVM nodes.

Event Name Severity Alert Message Recommended Action
Node Agents Down High

The agents running inside the Node VM appear to be down.

When event detected: "The agents running inside the node VM appear to be down."

When event resolved: "The agents inside the Node VM are running."

For ESX:

  1. If Vmk50 is missing, see Knowledge Base article 67432.
  2. If Hyperbus 4094 is missing: restarting nsx-cfgagent or restarting the container host VM may help.
  3. If container host VIF is blocked, check the connection to the controller make sure all configurations are sent down.
  4. If nsx-cfgagent has stopped, please restart nsx-cfgagent.

For KVM:

  1. If the Hyperbus namespace is missing, restarting the nsx-opsagent may help recreate the namespace.
  2. If Hyperbus interface is missing inside the hyperbus namespace, estarting the nsx-opsagent may help.
  3. If the nsx-agent has stopped, restart nsx-agent.

For both ESX and KVM:

  1. If the node-agent package is missing: check whether the node-agent package has been successfully installed in the container host VM.
  2. If the interface for the node-agent in the container host VM is down: check the eth1 interface status inside the container host VM.

NSX Federation Events

NSX Federation events arise from the NSX Manager, NSX Edge, and the public gateway nodes.

Event Name Severity Alert Message Recommended Action

GM To GM Latency Warning

Medium

Latency between Global Managers is higher than expected for more than 2 minutes.

When event detected: "Latency is higher than expected between Global Managers {from_gm_path} and {to_gm_path}."

When event resolved: "Latency is below expected levels between Global Managers {from_gm_path} and {to_gm_path}."

Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) through ping. If they are not pingable, check for flakiness in WAN connectivity.

GM To GM Synchronization Error

High

Active Global Manager to Standby Global Manager cannot synchronize for more than 5 minutes.

When event detected: "Active Global Manager {from_gm_path} to Standby Global Manager {to_gm_path} cannot synchronize for more than 5 minutes."

When event resolved: "Synchronization from active Global Manager {from_gm_path} to standby {to_gm_path} is healthy."

Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) through ping.

GM To GM Synchronization Warning

Medium

Active Global Manager to Standby Global Manager cannot synchronize.

When event detected: ": "Active Global Manager {from_gm_path} to Standby Global Manager {to_gm_path} cannot synchronize."

When event resolved: "Synchronization from active Global Manager {from_gm_path} to Standby Global Manager {to_gm_path} is healthy."

Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) through ping.

LM to LM Synchronization Error

High

Starting in NSX-T Data Center 3.0.1.

Synchronization between remote locations failed for more than 5 minutes.

When event detected: "The synchronization between {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for more than 5 minutes."

When event resolved: "Remote sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized."

  1. Invoke the NSX CLI command get site-replicator remote-sites to get the connection state between remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of resolving the primary node. In this case, wait for approximately 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step.
  2. Check the connectivity from Local Manager (LM) in location {site_name}{site_id} to the LMs in location {remote_site_name}{remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity. If there are no physical network connectivity issues, try the next step.
  3. Check the /var/log/cloudnet/nsx-ccp.log file on the Manager nodes in the local cluster in location {site_name}({site_id} that triggered the alarm to see if there are any cross-site communication errors. In addition, also look for errors being logged by the nsx-appl-proxy subcomponent within /var/log/syslog.
LM to LM Synchronization Warning Medium

Starting in NSX-T Data Center 3.0.1.

Synchronization between remote locations failed.

When event detected: "The synchronization between {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed."

When event resolved: "Remote locations {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized."

  1. Invoke the NSX CLI command get site-replicator remote-sites to get the connection state between remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of resolving the primary node. In this case, wait for approximately 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step.
  2. Check the connectivity from Local Manager (LM) in location {site_name}{site_id} to the LMs in location {remote_site_name}{remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity. If there are no physical network connectivity issues, try the next step.
  3. Check the /var/log/cloudnet/nsx-ccp.log file on the Manager nodes in the local cluster in location {site_name}({site_id} that triggered the alarm to see if there are any cross-site communication errors. In addition, also look for errors being logged by the nsx-appl-proxy subcomponent within /var/log/syslog.
RTEP BGP Down High

Starting in NSX-T Data Center 3.0.1.

RTEP BGP neighbor down.

When event detected: "RTEP (Remote Tunnel Endpoint) BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is down."

When event resolved: "RTEP (Remote Tunnel Endpoint) BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is established."

  1. Invoke the NSX CLI command get logical-routers on the affected edge node.
  2. Switch to REMOTE_TUNNEL_VRF context
  3. Invoke the NSX CLI command get bgp neighbor to check the BGP neighbor.
  4. Alternatively, invoke the NSX API GET /api/v1/transport-nodes/<transport-node-id>/inter-site/bgp/summary to get the BGP neighbor status.
  5. Invoke the NSX CLI command get interfaces and check if the correct RTEP IP address is assigned to the interface with name remote-tunnel-endpoint.
  6. . Check if the ping is working successfully between assigned RTEP IP address {bgp_source_ip} and the remote location {remote_site_name} neighbor IP {bgp_neighbor_ip}.
  7. Check /var/log/syslog for any errors related to BGP.
  8. Invoke the API GET or PUT /api/v1/transport-nodes/<transport-node-id> to get/update remote_tunnel_endpoint configuration on the edge node. This will update the RTEP IP assigned to the affected edge node.

GM To LM Synchronization Warning

Medium

Data synchronization between Global Manager (GM) and Local Manager (LM) failed.

When event detected: "Data synchronization between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for the {flow_identifier}."

When event resolved: "Sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized for {flow_identifier}."

  1. Check the network connectivity between remote site and local site through ping.
  2. Ensure port TCP/1236 traffic is allowed between the local and remote sites.
  3. Ensure the async-replicator service is running on both local and remote sites. Invoke the GET /api/v1/node/services/async_replicator/status NSX API or the get service async_replicator NSX CLI command to determine if the service is running.

    If the service is not running, invoke the POST /api/v1/node/services/async_replicator?action=restart NSX API or the restart service async_replicator NSX CLI to restart the service.
  4. Check /var/log/async-replicator/ar.log to see if there are errors reported.

GM To LM Synchronization Error

High

Data synchronization between Global Manager (GM) and Local Manager (LM) failed for an extended period.

When event detected: "Data synchronization between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for the {flow_identifier} for an extended period."

When event resolved: "Sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized for {flow_identifier}."

  1. Check the network connectivity between remote site and local site through ping.
  2. Ensure port TCP/1236 traffic is allowed between the local and remote sites.
  3. Ensure the async-replicator service is running on both local and remote sites. Invoke the GET /api/v1/node/services/async_replicator/status NSX API or the get service async_replicator NSX CLI command to determine if the service is running.

    If the service is not running, invoke the POST /api/v1/node/services/async_replicator?action=restart NSX API or the restart service async_replicator NSX CLI to restart the service.
  4. Check /var/log/async-replicator/ar.log to see if there are errors reported.
  5. Collect a support bundle and contact VMware Support.

Queue Occupancy Threshold Exceeded

Medium

Queue occupancy size threshold exceeded warning.

When event detected: "Queue ({queue_name}) used for syncing data between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached size {queue_size} which is at or above the maximum threshold of {queue_size_threshold}%."

When event resolved: "Queue ({queue_name}) used for syncing data between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached size {queue_size} which is below the maximum threshold of {queue_size_threshold}%."

Queue size can exceed threshold due to communication issue with remote site or an overloaded system. Please check system performance and /var/log/async-replicator/ar.log to see if there are any errors reported.

GM To LM Latency Warning Medium

Latency between global manager and local manager is higher than expected for more than 2 minutes.

When event detected: "Latency between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached {latency_value} which is above the threshold value of {latency_threshold}."

When event resolved: "Latency between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached {latency_value} which below the threshold value of {latency_threshold}."

  1. Check the network connectivity between remote site and local site through ping.
  2. Ensure port TCP/1236 traffic is allowed between the local and remote sites.
  3. Check /var/log/async-replicator/ar.log to see if there are errors reported.

Cluster Degraded

Medium

Group member is down.

When event detected: "Group member {manager_node_id} of service {group_type} is down."

When event resolved: "Group member {manager_node_id} of {group_type} is up."

  1. Invoke the NSX CLI command get cluster status to view the status of group members of the cluster.
  2. Ensure the service for {group_type} is running on node. Invoke the GET /api/v1/node/services/<service_name>/status NSX API or the get service <service_name> NSX CLI command to determine if the service is running.

    If the service is not running, invoke the POST /api/v1/node/services/<service_name>?action=restart NSX API or the restart service <service_name> NSX CLI to restart the service.
  3. Check /var/log/ of service {group_type} to see if there are errors reported.

Cluster Unavailable

High

All the group members of the service are down.

When event detected: "All group members {manager_node_ids} of service {group_type} are down."

When event resolved: "All group members {manager_node_ids} of service {group_type} are up."

  1. Ensure the service for {group_type} is running on node. Invoke the GET /api/v1/node/services/<service_name>/status NSX API or the get service <service_name> NSX CLI command to determine if the service is running.

    If not running, invoke the POST /api/v1/node/services/<service_name>?action=restart NSX API or the restart service <service_name> NSX CLI to restart the service.
  2. Check /var/log/ of service {group_type} to see if there are errors reported.

Password Management Events

Password management events arise from the NSX Manager, NSX Edge, and the public gateway nodes.

Event Name Severity Alert Message Recommended Action
Password Expired Critical

User password has expired.

When event detected: "The password for user {username} has expired."

When event resolved: "The password for the user {username} has been changed successfully or is no longer expired."

The password for the user {username} must be changed now to access the system. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:

PUT /api/v1/node/users/<userid>

where <userid> is the ID of the user. If the admin user (with <userid> 10000) password has expired, admin must login to the system via SSH (if enabled) or console in order to change the password. Upon entering the current expired password, admin will be prompted to enter a new password.

Password About To Expire High

User password is about to expire.

When event detected: "The password for user {username} is about to expire in {password_expiration_days} days.""

When event resolved: "The password for the user {username} has been changed successfully or is no longer about to expire."

Ensure the password for the user identified by {username} is changed immediately. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:

PUT /api/v1/node/users/<userid>

where <userid> is the ID of the user.

Password Expiration Approaching Medium

User password is approaching expiration.

When event detected: "The password for user {username} is about to expire in {password_expiration_days} days."

When event resolved: "The password for the user {username} has been changed successfully or is no longer about to expire."

The password for the user identified by {username} needs to be changed soon. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:

PUT /api/v1/node/users/<userid>

where <userid> is the ID of the user.

Routing Events

Event Name Severity Alert Message Recommended Action
BGP Down High

BGP neighbor down.

When event detected: "In Router {entity_id}, BGP neighbor {bgp_neighbor_ip} is down, reason: {failure_reason}."

When event resolved: "In Router {entity_id}, BGP neighbor {bgp_neighbor_ip} is up."

  1. SSH into the Edge node.
  2. Invoke the NSX CLI command: get logical-routers
  3. Switch to the service router {sr_id}.
  4. Check /var/log/syslog to see if there are any errors related to BGP connectivity.

BFD Down On External Interface

High

BFD session is down.

When event detected: "In router {entity_id}, BFD session for peer {peer_address} is down."

When event resolved: "In router {entity_id}, BFD session for peer {peer_address} is up."

  1. Invoke the NSX CLI command get logical-routers.
  2. Switch to the service router {sr_id}.
  3. Invoke the NSX CLI command ping {peer_address} to verify the connectivity.
Routing Down High

All BGP/BFD sessions are down.

When event detected: "All BGP/BFD sessions are down."

When event resolved: "At least one BGP/BFD sessions up."

  1. Invoke the NSX CLI command get logical-routers to get the Tier0 service router.
  2. Switch to the Tier0 service router VRF, then invoke the following NSX CLI commands:
    • Verify connectivity: ping <BFD peer IP address>
    • Check BFD health:
      get bfd-config 
      get bfd-sessions
    • Check BGP health: get bgp neighbor summary
      get bfd neconfig 
      get bfd-sessions
    Check /var/log/syslog to see if there are any errors related to BGP connectivity.
Static Routing Removed High

Static route removed.

When event detected: "In router {entity_id}, static route {static_address} was removed because BFD was down."

When event resolved: "In router {entity_id}, static route {static_address} was re-added as BFD recovered."

  1. SSH into the Edge node.
  2. Invoke the NSX CLI command: get logical-routers
  3. Switch to the service router {sr_id}.
  4. Verify the connectivity by invoking the NSX CLI command:
    get bgp neighbor summary
  5. Also, verify the configuration in both NSX and the BFD peer to ensure that timers have not been changed.
MTU Mismatch Within Transport Zone High MTU configuration mismatch between Transport Nodes (ESXi, KVM and Edge) attached to the same Transport Zone. MTU values on all switches attached to the same Transport Zone not being consistent will cause connectivity issues.
  1. In the NSX UI, navigate to System > Fabric > Settings and click Inconsistent from MTU Configuration Check to see more mismatch details.
  2. Set the same MTU value on all switches attached to the same Transport Zone by invoking the NSX API,

    PUT /api/v1/host-switch-profiles/<host-switch-profile-id>

    with mtu in the request body, or API,

    PUT /api/v1/global-configs/SwitchingGlobalConfig

    with physical_uplink_mtu in request body.
Global Router MTU Too Big Medium The global router MTU configuration is bigger than MTU of switches in overlay Transport Zone which connects to Tier0 or Tier1. Global router MTU value should be less than all switches MTU value by at least a 100 as we require 100 quota for Geneve encapsulation.
  1. In the NSX UI, navigate to System > Fabric > Settings and click Inconsistent from MTU Configuration Check to see more mismatch details.
  2. Set the bigger MTU value on switches by invoking the NSX API, PUT /api/v1/host-switch-profiles/<host-switch-profile-id> with mtu in the request body, or API,

    PUT /api/v1/global-configs/SwitchingGlobalConfig with physical_uplink_mtu in request body

  3. Or set the smaller MTU value of global router configuration by invoking the NSX API,

    PUT /api/v1/global-configs/RoutingGlobalConfig

    with logical_uplink_mtu in the request body.

Transport Node Health

Transport node health events arise from the KVM and ESXi nodes.

Event Name Severity Alert Message Recommended Action
LAG Member Down Medium

LACP reporting member down.

When event detected: "LACP reporting member down."

When event resolved: "LACP reporting member up."

Check the connection status of LAG members on hosts.
  1. In the NSX UI, navigate to Fabric > Nodes > Transport Nodes > Host Transport Nodes.
  2. In the Host Transport Nodes list, check the Node Status column.

    Find the Transport node with the degraded or down Node Status.

  3. Select <transport node> > Monitor.

    Find the bond (uplink) which is reporting degraded or down.

  4. Check the LACP member status details by logging into the failed host and running the appropriate command:
    • ESXi: esxcli network vswitch dvs vmware lacp status get
    • KVM: ovs-appctl bond/show and ovs-appctl lacp/show

Transport Node Uplink Down

Medium

Uplink is going down.

When event detected: "Uplink is going down."

When event resolved: "Uplink is going up."

Check the physical NICs status of uplinks on hosts.
  1. In the NSX UI, navigate to Fabric > Nodes > Transport Nodes > Host Transport Nodes.
  2. In the Host Transport Nodes list, check the Node Status column.

    Find the Transport node with the degraded or down Node Status.

  3. Select <transport node> > Monitor.

    Check the status details of the bond (uplink) which is reporting degraded or down.

    To avoid a degraded state, ensure all uplink interfaces are connected and up regardless of whether they are in use or not.

VPN Events

VPN events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action
IPsec Policy-Based Session Down Medium

Policy-based IPsec VPN session is down.

When event detected: "The policy-based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}."

When event resolved: "The policy-based IPsec VPN session {entity_id} is up.

Check IPsec VPN session configuration and resolve errors based on the session down reason.

IPsec Route-Based Session Down Medium

Route-based IPsec VPN session is down.

When event detected: "The route-based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}."

When event resolved: "The route-based IPsec VPN session {entity_id} is up."

Check IPsec VPN session configuration and resolve errors based on the session down reason.

IPsec Policy-Based Tunnel Down Medium

Policy-based IPsec VPN tunnels are down.

When event detected: "One or more policy-based IPsec VPN tunnels in session {entity_id} are down."

When event resolved: "All policy-based IPsec VPN tunnels in session {entity_id} are up."

Check IPsec VPN session configuration and resolve errors based on the tunnel down reason.

IPsec Route-Based Tunnel Down Medium

Route-based IPsec VPN tunnels are down.

When event detected: "One or more route-based IPsec VPN tunnels in session {entity_id} are down."

When event resolved: "All route-based IPsec VPN tunnels in session {entity_id} are up."

Check IPsec VPN session configuration and resolve errors based on the tunnel down reason.

L2VPN Session Down Medium

L2VPN session is down.

When event detected: "The L2VPN session {entity_id} is down."

When event resolved: "The L2VPN session {entity_id} is up."

Check IPsec VPN session configuration and resolve errors based on the reason.

IPsec Service Down

Medium

IPsec service is down. To view the reason why service is down, click the View Runtime Details link.

When event detected: "The IPsec service {entity_id} is down."

When event resolved: "The IPsec service {entity_id} is up."

  1. Disable and enable the IPsec service from NSX Manager UI.
  2. If the issue still persists, contact VMware support.