The following tables describe the events that trigger alarms, including alarm messages and recommended actions to resolve them. Any event with a severity greater than LOW triggers an alarm.

Alarm Management Events

Alarm management events arise from the NSX Manager and Global Manager nodes.

Event Name Severity Alert Message Recommended Action
Alarm Service Overloaded Critical

The alarm service is overloaded.

When event detected: "Due to heavy volume of alarms reported, the alarm service is temporarily overloaded. The NSX UI and GET /api/v1/alarms NSX API have stopped reporting new alarms. Syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of alarms are addressed, the alarm service starts reporting new alarms again."

When event resolved: "The heavy volume of alarms has subsided and new alarms are being reported again."

Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm, investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new alarms again.

Heavy Volume of Alarms Critical

Heavy volume of a specific alarm type detected.

When event detected: "Due to heavy volume of {event_id} alarms, the alarm service has temporarily stopped reporting alarms of this type. The NSX UI and GET /api/v1/alarms NSX API are not reporting new instances of these alarms. Syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of {event_id} alarms are addressed, the alarm service starts reporting new {event_id} alarms when new issues are detected again."

When event resolved: "The heavy volume of {event_id} alarms has subsided and new alarms of this type are being reported again."

Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm, investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new {event_id} alarms again.

Certificates Events

Certificate events arise from the NSX Manager node.

Event Name Severity Alert Message Recommended Action
Certificate Expired Critical

A certificate has expired.

When event detected: "Certificate {entity-id} has expired."

When event resolved: "The expired certificate {entity-id} has been removed or is no longer expired.

Ensure services that are currently using the certificate are updated to use a new, non-expired certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:

POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>

where <cert-id> is the ID of a valid certificate reported by the API call GET /api/v1/trust-management/certificates.

After the expired certificate is no longer in use, it should be deleted with the following API call:

DELETE /api/v1/trust-management/certificates/{entity_id}

Certificate About to Expire High

A certificate is about to expire.

When event detected: "Certificate {entity-id} is about to expire."

When event resolved: "The expiring certificate {entity-id} or is no longer about to expire."

Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:

POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>

where <cert-id> is the ID of a valid certificate reported by the API call GET /api/v1/trust-management/certificates.

After the expiring certificate is no longer in use, it should be deleted using the API call:

DELETE /api/v1/trust-management/certificates/{entity_id}

Certificate Expiration Approaching Medium

A certificate is approaching expiration.

When event detected: "Certificate {entity-id} is approaching expiration."

When event resolved: "The expiring certificate {entity-id} or is no longer approaching expiration."

Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:

POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>

where <cert-id> is the ID of a valid certificate reported by the API call GET /api/v1/trust-management/certificates.

After the expiring certificate is no longer in use, it should be deleted using the API call:

DELETE /api/v1/trust-management/certificates/{entity_id}

CNI Health Events

CNI health events arise from the ESXi and KVM nodes.

Event Name Severity Alert Message Recommended Action
Hyperbus Manager Connection Down Medium

Hyperbus cannot communicate with the Manager node.

When event detected: "Hyperbus cannot communicate with the Manager node."

When event resolved: "Hyperbus can communicate with the Manager node."

The hyperbus vmkernel interface (vmk50) may be missing. See Knowledge Base article 67432.

DHCP Events

DHCP events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action
Pool Lease Allocation Failed High

IP addresses in an IP Pool have been exhausted.

When event detected: "The addresses in IP Pool {entity_id} of DHCP Server {dhcp_server_id} have been exhausted. The last DHCP request has failed and future requests will fail."

When event resolved: "IP Pool {entity_id} of DHCP Server {dhcp_server_id} is no longer exhausted. A lease is successfully allocated to the last DHCP request."

Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool.

Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease.

Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking > Segments > Segment page in the NSX UI.

Pool Overloaded Medium

An IP Pool is overloaded.

When event detected: "DHCP Server {dhcp_server_id} IP Pool {entity_id} usage is approaching exhaustion with {dhcp_pool_usage}% IPs allocated."

When event resolved: "The DHCP Server {dhcp_server_id} IP Pool {entity_id} has fallen below the high usage threshold."

Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool.

Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease.

Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking > Segments > Segment page in the NSX UI.

Distributed Firewall Events

Distributed firewall events arise from the NSX Manager or ESXi nodes.

Event Name Severity Alert Message Recommended Action
Distributed Firewall CPU Usage Very High Critical

Distributed firewall CPU usage is very high.

When event detected: "The DFW CPU usage on Transport node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "DNS forwarder {entity_id} is running again."

Consider re-balancing the VM workloads on this host to other hosts.

Please review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter.

Distributed Firewall Memory Usage Very High Critical

Distributed firewall memory usage is very high.

When event detected: "The DFW memory usage {heap_type} on Transport Node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The DFW memory usage {heap_type} on Transport Node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%."

View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the host.

Consider re-balancing the workloads on this host to other hosts.

DNS Events

DNS events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action
Forwarder Down High

A DNS forwarder is down.

When event detected: "DNS forwarder {entity_id} is not running. This is impacting all configured DNS Forwarders that are currently enabled."

When event resolved: "DNS forwarder {entity_id} is running again."

  1. Invoke the NSX CLI command get dns-forwarders status to verify if the DNS forwarder is in down state.
  2. Check /var/log/syslog to see if there are errors reported.
  3. Collect a support bundle and contact the NSX support team.
Forwarder Disabled High

A DNS forwarder is disabled.

When event detected: "DNS forwarder {entity_id} is disenabled."

When event resolved: ""DNS forwarder {entity_id} is enabled."

  1. Invoke the NSX CLI commandget dns-forwarders status to verify if the DNS forwarder is in a disabled state.
  2. Use NSX Policy API or Manager API to enable the DNS forwarder it should not be in the disabled state.

Edge Health Events

Edge health events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action
Edge CPU Usage Very High Critical

Edge node CPU usage is very high.

When event detected: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.
Edge CPU Usage High Medium

Edge node CPU usage is high.

When event detected: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.
Edge Datapath Configuration Failure High

Edge node datapath configuration has failed.

When event detected: "Failed to enable the datapath on the Edge node after three attempts."

When event resolved: "Datapath on the Edge node has been successfully enabled."

Ensure the Edge node connection to the Manager node is healthy.

From the Edge node NSX CLI, invoke the command get services to check the health of services.

If the dataplane service is stopped, invoke the command start service dataplane to restart it.

Edge Datapath CPU Usage Very High Critical

Edge node datapath CPU usage is very high.

When event detected: "The datapath CPU usage on Edge node {entity-id} has reached {datapath_resource_usage}% which is at or above the very high threshold for at least two minutes."

When event resolved: "Datapath CPU usage on Edge node {entity-id} has reduced below the maximum threshold."

Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core.

Higher CPU usage is expected with higher packet rates.

Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters.

Edge Datapath CPU Usage High Medium

Edge node datapath CPU usage is high.

When event detected: "The datapath CPU usage on Edge node {entity-id} has reached {datapath_resource_usage}% which is at or above the high threshold for at least two minutes."

When event resolved: "The CPU usage on Edge node {entity-id} has reached below the high threshold."

Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core.

Higher CPU usage is expected with higher packet rates.

Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters.

Edge Datapath Crypto Driver Down Critical

The Edge node datapath crypto driver is down.

When event detected: "Edge node crypto driver is down."

When event resolved: "Edge node crypto driver is up."

Upgrade the Edge node as needed.

Edge Datapath Memory Pool is High Medium

The Edge node datapath memory pool is high.

When event detected: "The datapath mempool usage for {mempool_name} on Edge node {entity-id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The datapath mempool usage for {mempool_name} on Edge node {entity-id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%."

Log in as the root user and invoke the commands edge-appctl -t /var/run/vmware/edge/dpd.ctl mempool/show and edge-appctl -t /var/run/vmware/edge/dpd.ctl memory/show malloc_heap to check DPDK memory usage.
Edge Disk Usage Very High Critical

Edge node disk usage is very high.

When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Edge Disk Usage High Medium

Edge node disk usage is high.

When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Edge Global ARP Table Usage High Medium

The Edge node global ARP table usage is high.

When event detected: "Global ARP table usage on Edge node {entity-id} has reached {datapath_resource_usage}% which is above the high threshold for over two minutes."

When event resolved: "Global arp table usage on Edge node {entity-id} has reached below the high threshold."

Increase the ARP table size:
  1. Log in as the root user.
  2. Invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/show.
  3. Check if neigh cache usage is normal.
    1. If it is normal, invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/set_param max_entries to increase the ARP table size.
Edge Memory Usage Very High Critical

Edge node memory usage is very high.

When event detected: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.
Edge Memory Usage High Medium

Edge node memory usage is high.

When event detected: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.
Edge NIC Link Status Down Critical

Edge node NIC link is down.

When event detected: "Edge node NIC {edge_nic_name} link is down."

When event detected: "Edge node NIC {edge_nic_name} link is up."

On the Edge node, confirm if the NIC link is physically down by invoking the NSX CLI command get interfaces.

If it is down, verify the cable connection.

Edge NIC Out of Receive Buffer Critical

Edge node NIC receive descriptor ring buffer has no space left.

When event detected: "Edge NIC {edge_nic_name} receive ring buffer has overflowed by {rx_ring_buffer_overflow_percentage}% on Edge node {entity-id} for over 60 seconds."

When event resolved: "Edge NIC {edge_nic_name} receive ring buffer usage on Edge node {entity-id} is no longer overflowing."

Invoke the NSX CLI command get dataplane, and check the following:
  1. If PPS and CPU usage is high and check rx ring size by invoking using get dataplane ring-size rx.
    • If If PPS and CPU is high and rx ring size is low, invoke set dataplane ring-size rx <ring-size>, and set <ring-size> to a higher value to accommodate incoming packets.
    • If the above condition is not satisfied, and ring size is high and CPU usage is still high, the cause may be due to a dataplane processing overhead delay.
Edge NIC Out of Transmit Buffer Critical

Edge node NIC transmit descriptor ring buffer has no space left.

When event detected: "Edge node NIC {edge_nic_name} transmit ring buffer has overflowed by {tx_ring_buffer_overflow_percentage}% on Edge node {entity-id} for over 60 seconds."

When event resolved: "Edge node NIC {edge_nic_name} transmit ring buffer usage on Edge node {entity-id} is no longer overflowing."

Invoke the NSX CLI command get dataplane, and check the following:
  1. If PPS and CPU usage is high and check rx ring size by invoking using get dataplane ring-size tx.
    • If If PPS and CPU is high and tx ring size is low, invoke set dataplane ring-size tx <ring-size>, and set <ring-size> to a higher value to accommodate outgoing packets.
    • If the above condition is not satisfied, and ring size is high and CPU usage is low or nominal, the cause may be due to the transmit ring size setting on the hypervisor.
Storage Error Critical

Starting in NSX-T Data Center 3.0.1.

The following disk partitions on the Edge node are in read-only mode: {disk_partition_name}

.

Examine the read-only partition to see if reboot resolves the issue, or the disk needs to be replaced. Refer to KB article https://kb.vmware.com/s/article/2146870.

Endpoint Protection Events

Endpoint protection events arise from the NSX Manager or ESXi nodes.

Event Name Severity Alert Message Recommended Action
EAM Status Down Critical

ESX Agent Manager (EAM) service on a compute manager is down.

When event detected: "ESX Agent Manager (EAM) service on compute manager {entity_id} is down."

When event resolved: "ESX Agent Manager (EAM) service on compute manager {entity_id} is either up or compute manager {entity_id} has been removed."

Restart the ESX Agent Manager (EAM) service:
  • SSH into the vCenter node and run:
    service vmware-eam start
Partner Channel Down Critical

Host module and Partner SVM connection is down.

When event detected: "The connection between host module and Partner SVM {entity_id} is down."

When event resolved: "The connection between host module and Partner SVM {entity_id} is up."

See Knowledge Base article 2148821 Troubleshooting NSX Guest Introspection and make sure that the Partner SVM identified by {entity_id} is reconnected to the host module.

Federation Events

Federation events arise from the NSX Manager, NSX Edge, and the public gateway nodes.

Event Name Severity Alert Message Recommended Action

LM to LM Synchronization Error

High

Starting in NSX-T Data Center 3.0.1.

The synchronization between {site_name}({site_id} and {remote_site_name}({remote_site_id} failed for more than 5 minutes.

  1. Invoke the NSX CLI command get site-replicator remote-sites to get the connection state between remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of master resolution. In this case, wait for approximately 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step.

  2. Check the connectivity from Local Manager (LM) in location {site_name}{site_id} to the LMs in location {remote_site_name}{remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity. If there are no physical network connectivity issues, try the next step.

  3. Check the /var/log/cloudnet/nsx-ccp.log file on the Manager nodes in the local cluster in location {site_name}({site_id} that triggered the alarm to see if there are any cross-site communication errors. In addition, also look for errors being logged by the nsx-appl-proxy subcomponent within /var/log/syslog.

LM to LM Synchronization Warning Medium

Starting in NSX-T Data Center 3.0.1.

The synchronization between {site_name}({site_id} and {remote_site_name}({remote_site_id} failed.

  1. Invoke the NSX CLI command get site-replicator remote-sites to get the connection state between remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of master resolution. In this case, wait for approximately 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step.

  2. Check the connectivity from Local Manager (LM) in location {site_name}{site_id} to the LMs in location {remote_site_name}{remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity. If there are no physical network connectivity issues, try the next step.

  3. Check the /var/log/cloudnet/nsx-ccp.log file on the Manager nodes in the local cluster in location {site_name}({site_id} that triggered the alarm to see if there are any cross-site communication errors. In addition, also look for errors being logged by the nsx-appl-proxy subcomponent within /var/log/syslog.

RTEP BGP Down High

Starting in NSX-T Data Center 3.0.1.

RTEP BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is down. Reason: {failure_reason}.

  1. Invoke the NSX CLI command get logical-routers on the affected edge node.

  2. Switch to REMOTE_TUNNEL_VRF context
  3. Invoke the NSX CLI command get bgp neighbor to check the BGP neighbor.
  4. Alternatively, invoke the NSX API GET /api/v1/transport-nodes/<transport-node-id>/inter-site/bgp/summary to get the BGP neighbor status.
  5. Invoke the NSX CLI command get interfaces and check if the correct RTEP IP address is assigned to the interface with name remote-tunnel-endpoint.
  6. . Check if the ping is working successfully between assigned RTEP IP address {bgp_source_ip} and the remote location {remote_site_name} neighbor IP {bgp_neighbor_ip}.
  7. Check /var/log/syslog for any errors related to BGP.
  8. Invoke the API GET or PUT /api/v1/transport-nodes/<transport-node-id> to get/update remote_tunnel_endpoint configuration on the edge node. This will update the RTEP IP assigned to the affected edge node.

High Availability Events

High availability events arise from the NSX Edge and public cloud gateway nodes.

Event Name Severity Alert Message Recommended Action
Tier0 Gateway Failover High

A tier0 gateway has failed over.

When event detected: "The tier0 gateway {entity-id} failover from {previous_gateway_state} to {current_gateway_state}."

When event resolved: "The tier0 gateway {entity-id} is now up."

Determine the service that is down and restart it.
  1. Identify the tier0 VRF ID by running the NSX CLI command get logical-routers.
  2. Switch to the VRF context by running vrf <vrf-id>.
  3. View which service is down by running get high-availability status.
Tier1 Gateway Failover High

A tier1 gateway has failed over.

When event detected: "The tier1 gateway {entity-id} failover from {previous_gateway_state} to {current_gateway_state}."

When event resolved: "The tier1 gateway {entity-id} is now up."

Determine the service that is down and restart it.
  1. Identify the tier1 VRF ID by running the NSX CLI command get logical-routers.
  2. Switch to the VRF context by running vrf <vrf-id>.
  3. View which service is down by running get high-availability status.

Infrastructure Communication Events

Infrastructure communication events arise from the NSX Edge, KVM, ESXi, and public gateway nodes.

Event Name Severity Alert Message Recommended Action
Edge Tunnels Down Critical

An Edge node's tunnel status is down.

When event detected: "Overall tunnel status of Edge node {entity_id} is down."

When event resolved: "The tunnels of Edge node {entity_id} have been restored."

  1. Using SSH, log into the Edge node.
  2. Get the status.
    nsxcli get tunnel-ports
  3. On each tunnel, check the stats for any drops.
    get tunnel-port <UUID> stats
  4. Check the syslog file for any tunnel related errors.

Infrastructure Service Events

Infrastructure service events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action
Edge Service Status Down Critical

Edge service is down for at least one minute.

When event detected: "The service {edge_service_name} is down for at least one minute."

When event resolved: "The service {edge_service_name} is up."

On the Edge node, verify the service hasn't exited due to an error by looking for core dump files in the /var/log/core directory.

To confirm whether the service is stopped, invoke the NSX CLI command get services.

If so, run start service <service-name> to restart the service.

Edge Service Status Changed Low

Edge service status has changed.

When event detected: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}."

When event resolved: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}."

On the Edge node, verify the service hasn't exited due to an error by looking for core dump files in the /var/log/core directory.

To confirm whether the service is stopped, invoke the NSX CLI command get services.

If so, run start service <service-name> to restart the service.

Intelligence Communication Events

NSX Intelligence communication events arise from the NSX Manager node, ESXi node, and NSX Intelligence appliance.

Event Name Severity Alert Message Recommended Action
Transport node flow exporter disconnected High

A Transport node is disconnected from its Intelligence node's messaging broker. Data collection is affected.

When event detected: "The flow exporter on Transport node {entity-id} is disconnected from the Intelligence node's messaging broker. Data collection is affected."

When event resolved: "The flow exporter on Transport node {entity-id} has reconnected to the Intelligence node's messaging broker."

  1. Restart messaging service if it is not running in the NSX Intelligence node.
  2. Resolve the network connection failure between the transport node and the NSX Intelligence node.

Intelligence Health Events

NSX Intelligence health events arise from the NSX Manager node and NSX Intelligence appliance.

Event Name Severity Alert Message Recommended Action
CPU Usage Very High Critical

Intelligence node CPU usage is very high.

When event detected: "The CPU usage on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%."

Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

CPU Usage High Medium

Intelligence node CPU usage is high.

When event detected: "The CPU usage on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%."

Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

Memory Usage Very High Critical

Intelligence node memory usage is very high.

When event detected: "The memory usage on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%."

Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

Memory Usage High Medium

Intelligence node memory usage is high.

When event detected: "The memory usage on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%."

Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

Disk Usage Very High Critical

Intelligence node disk usage is very high.

When event detected: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%."

Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed.
Disk Usage High Medium

Intelligence node disk usage is high.

When event detected: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%."

Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed.
Data disk partition usage very high Critical

Intelligence node data disk partition usage is very high.

When event detected: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%.

When event resolved: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%."

Stop NSX Intelligence data collection until the disk usage is below the threshold.

In the NSX UI, navigate to System Appliances NSX Intelligence Appliance. Then select ACTIONS > Stop Collecting Data.

Data disk partition usage high Medium

Intelligence node data disk partition usage is high.

When event detected: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%.

When event resolved: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%."

Stop NSX Intelligence data collection until the disk usage is below the threshold.

Examine the /data partition and see if there are any unexpected large files that can be removed.

Node status degraded High

Intelligence node status is degraded.

When event detected: "Service {service_name}on NSX Intelligence node {intelligence_node_id} is not running."

When event resolved: "Service {service_name}on NSX Intelligence node {intelligence_node_id} is running properly."

Examine service status and health information with NSX CLI command get services in the NSX Intelligence node.

Restart unexpected stopped services with NSX CLI command restart service <service-name>.

License Events

License events arise from the NSX Manager node.

Event Name Severity Alert Message Recommended Action
License Expired Critical

A license has expired.

When event detected: "The license of type {license_edition_type} has expired."

When event resolved: "The expired license of type {license_edition_type} has been removed, updated, or is no longer expired."

Add a new, non-expired license:
  1. In the NSX UI, by navigate to System > Licenses.
  2. Click Add and specify the key of the new license.
  3. Delete the expired license by selecting the check box and clicking Unassign.
License About to Expire Medium

When event detected: "The license of type {license_edition_type} is about to expire."

When event resolved: "The expiring license identified by {license_edition_type}has been removed, updated, or is no longer about to expire."

Add a new, non-expired license:
  1. In the NSX UI, by navigate to System > Licenses.
  2. Click Add and specify the key of the new license.
  3. Delete the expired license by selecting the check box and clicking Unassign.

Load Balancer Events

Load balancer events arise from the NSX Edge node.

Event Name Severity Alert Message Recommended Action
Load Balancer CPU Very High Medium

Load balancer CPU usage is very high.

When event detected: "The CPU usage of load balancer {entity_id} is {system_resource_usage}%, which is higher than the very high threshold of {system_usage_threshold}%."

When event resolved: "The CPU utilization of load balancer {entity_id} is {system_resource_usage}%, which is lower than the very high threshold of {system_usage_threshold}%."

If the load balancer CPU utilization of is higher than {system_usage_threshold}%, the workload is too high for this load balancer.

Rescale the load balancer service by changing the load balancer size from small to medium or from medium to large.

If the CPU utilization of this load balancer is still high, consider adjusting the Edge appliance form factor size or moving load balancer services to other Edge nodes for the applicable workload.

Load Balancer Status Down Medium

Load balancer service is down.

When event detected: "The load balancer service {entity_id} is down."

When event resolved: "The load balancer service {entity_id} is up."

Verify whether the load balancer service in the Edge node is running.

If the status of the load balancer service is not ready, move the Edge node into maintenance mode, then exit maintenance mode.

If the status of the load balancer service is still not recovered, please check whether there are any error log in syslog.

Virtual Server Status Down Medium

Load balancer virtual service is down.

When event detected: "The load balancer virtual server {entity_id} is down."

When event resolved: "The load balancer virtual server {entity_id} is up."

Consult the load balancer pool to determine its status and verify its configuration.

If incorrectly configured, reconfigure it and remove the load balancer pool from the virtual server then re-add it to the virtual server again.

Pool Status Down Medium

When event detected: "The load balancer pool {entity_id} status is down."

When event resolved: "The load balancer pool {entity_id} status is up."

  1. Consult the load balancer pool to determine which members are down.
  2. Check network connectivity from the load balancer to the impacted pool members.
  3. Validate application health of each pool member.
  4. Validate the health of each pool member using the configured monitor.

When the health of the member is established, the pool member status is updated to healthy based on the Rise Count.

Manager Health Events

NSX Manager health events arise from the NSX Manager node cluster.

Event Name Severity Alert Message Recommended Action
Duplicate IP Address Medium

Manager node's IP address is in use by another device.

When event detected: "Manager node {entity_id} IP address {duplicate_ip_address} is currently being used by another device in the network."

When event detected: "Manager node {entity_id} appears to no longer be using {duplicate_ip_address}."

  1. Determine which device is using the Manager's IP address and assign the device a new IP address.
    Note: Reconfiguring the Manager to use a new IP address is not supported.
  2. Verify if the static IP address pool/DHCP server is configured correctly.
  3. Correct the IP address of the device if it is manually assigned.
Manager CPU Usage Very High Critical

Manager node CPU usage is very high.

When event detected: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Manager node.

Consider adjusting the Manager appliance form factor size.

Manager CPU Usage High Medium

Starting in NSX-T Data Center 3.0.1.

Manager node CPU usage is high.

When event detected: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Manager node.

Consider adjusting the Manager appliance form factor size.

Manager Memory Usage Very High Critical

Starting in NSX-T Data Center 3.0.1.

Manager node memory usage is very high.

When event detected: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Manager node.

Consider adjusting the Manager appliance form factor size.

Manager Memory Usage High Medium

Manager node memory usage is high.

When event detected: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Please review the configuration, running services and sizing of this Manager node.

Consider adjusting the Manager appliance form factor size.

Manager Disk Usage Very High Critical

Manager node disk usage is very high.

When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Manager Disk Usage High Medium

Manager node disk usage is high.

When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%."

When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Manager Configuration Disk Usage Very High Critical

Manager node config disk usage is very high.

When event detected: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of high disk usage by the NSX Datastore service under the /config/corfu directory."

When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%."

Examine the /config partition and see if there are any unexpected large files that can be removed.
Manager Configuration Disk Usage High Medium

Manager node config disk usage is high.

When event detected: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk usage by the NSX Datastore service under the /config/corfu directory."

When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%."

Examine the /config partition and see if there are any unexpected large files that can be removed.

Operations DB Disk Usage High

Medium

The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk useage by the NSX Datastore service under the /nonconfig/corfu directory.

Please run the following tool, and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig.

Operations DB Disk Usage Very High Critical

The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of rising disk useage by the NSX Datastore service under the /nonconfig/corfu directory.

Please run the following tool, and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig.

NCP Events

NSX Container Plug-in (NCP) events arise from the ESXi and KVM nodes.

Event Name Severity Alert Message Recommended Action
NCP Plugin Down Critical

Manager Node has detected the NCP is down or unhealthy.

When event detected: "Manager node has detected the NCP is down or unhealthy."

When event resolved: "Manager Node has detected the NCP is up or healthy again."

To find the clusters which are having issues, invoke the NSX API: GET /api/v1/systemhealth/container-cluster/ncp/status to fetch all cluster statuses and determine the name of any clusters that report DOWN or UNKNOWN.

Go to the NSX UI Inventory > Container > Clusters page to find the names of clusters that reported DOWN or UNKNOWN status and click the Nodes tab which lists all Kubernetes and PAS cluster members.

For Kubernetes cluster:
  1. Check NCP Pod liveness by finding the K8s master node from all the cluster members and log onto the master node.

    Then invoke the kubectl command kubectl get pods --all-namespaces. If there is an issue with the NCP Pod, please use kubectl logs command to check the issue and fix the error.

  2. Check the connection between NCP and Kubernetes API server.
    The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following commands from the master VM.
    kubectl exec -it <NCP-Pod-Name> -n nsx-system bash
    nsxcli
    get ncp-k8s-api-server status
    If there is an issue with the connection, please check both the network and NCP configurations.
  3. Check the connection between NCP and NSX Manager.
    The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following command from the master VM.
    kubectl exec -it <NCP-Pod-Name> -n nsx-system bash nsxcli get ncp-nsx status
    If there is an issue with the connection, please check both the network and NCP configurations.
For PAS cluster:
  1. Check the network connections between virtual machines and fix any network issues.
  2. Check the status of both nodes and services and fix crashed nodes or services.

    Invoke the commands bosh vms and bosh instances -p to check the status of nodes and services.

Node Agents Health Events

Node agent health events arise from the ESXi and KVM nodes.

Event Name Severity Alert Message Recommended Action
Node Agents Down High

The agents running inside the Node VM appear to be down.

When event detected: "The agents running inside the node VM appear to be down."

When event resolved: "The agents inside the Node VM are running."

For ESX:

  1. If Vmk50 is missing, see Knowledge Base article 67432.
  2. If Hyperbus 4094 is missing: restarting nsx-cfgagent or restarting the container host VM may help.
  3. If container host VIF is blocked, check the connection to the controller make sure all configurations are sent down.
  4. If nsx-cfgagent has stopped, please restart nsx-cfgagent.

For KVM:

  1. If the Hyperbus namespace is missing, restarting the nsx-opsagent may help recreate the namespace.
  2. If Hyperbus interface is missing inside the hyperbus namespace, estarting the nsx-opsagent may help.
  3. If the nsx-agent has stopped, restart nsx-agent.

For both ESX and KVM:

  1. If the node-agent package is missing: check whether the node-agent package has been successfully installed in the container host VM.
  2. If the interface for the node-agent in the container host VM is down: check the eth1 interface status inside the container host VM.

Password Management Events

Password management events arise from the NSX Manager, NSX Edge, and the public gateway nodes.

Event Name Severity Alert Message Recommended Action
Password expired Critical

User password has expired.

When event detected: "The password for user {username} has expired."

When event resolved: "The password for the user {username} has been changed successfully or is no longer expired."

The password for the user {username} must be changed now to access the system. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:

PUT /api/v1/node/users/<userid>

where <userid> is the ID of the user. If the admin user (with <userid> 10000) password has expired, admin must login to the system via SSH (if enabled) or console in order to change the password. Upon entering the current expired password, admin will be prompted to enter a new password.

Password about to expire High

User password is about to expire.

When event detected: "The password for user {username} is about to expire in {password_expiration_days} days.""

When event resolved: "The password for the user {username} has been changed successfully or is no longer about to expire."

Ensure the password for the user identified by {username} is changed immediately. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:

PUT /api/v1/node/users/<userid>

where <userid> is the ID of the user.

Password expiration approaching Medium

User password is approaching expiration.

When event detected: "The password for user {username} is about to expire in {password_expiration_days} days."

When event resolved: "The password for the user {username} has been changed successfully or is no longer about to expire."

The password for the user identified by {username} needs to be changed soon. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:

PUT /api/v1/node/users/<userid>

where <userid> is the ID of the user.

Routing Events

Event Name Severity Alert Message Recommended Action
BGP Down High

BGP neighbor down.

When event detected: "In Router {entity_id}, BGP neighbor {bgp_neighbor_ip} is down, reason: {failure_reason}."

When event resolved: "In Router {entity_id}, BGP neighbor {bgp_neighbor_ip} is up."

  1. SSH into the Edge node.
  2. Invoke the NSX CLI command: get logical-routers
  3. Switch to the service router {sr_id}.
  4. Check /var/log/syslog to see if there are any errors related to BGP connectivity.

Bidirectional Forwarding Detection Down (BFD) on External Interface

High

BFD session is down.

When event detected: "In router {entity_id}, BFD session for peer {peer_address} is down."

When event resolved: "In router {entity_id}, BFD session for peer {peer_address} is up."

  1. SSH into the Edge node.
  2. Invoke the NSX CLI command: get logical-routers
  3. Switch to the service router {sr_id}.
  4. Verify the connectivity by invoking the NSX CLI command: ping <peer_address>.
Routing Down High

All BGP/BFD sessions are down.

When event detected: "All BGP/BFD sessions are down."

When event resolved: "At least one BGP/BFD sessions up."

  1. Invoke the NSX CLI command get logical-routers to get the Tier0 service router.
  2. Switch to the Tier0 service router VRF, then invoke the following NSX CLI commands:
    • Verify connectivity: ping <BFD peer IP address>
    • Check BFD health:
      get bfd-config 
      get bfd-sessions
    • Check BGP health: get bgp neighbor summary
      get bfd neconfig 
      get bfd-sessions
    Check /var/log/syslog to see if there are any errors related to BGP connectivity.
Static Routing Removed High

Static route removed.

When event detected: "In router {entity_id}, static route {static_address} was removed because BFD was down."

When event resolved: "In router {entity_id}, static route {static_address} was re-added as BFD recovered."

  1. SSH into the Edge node.
  2. Invoke the NSX CLI command: get logical-routers
  3. Switch to the service router {sr_id}.
  4. Verify the connectivity by invoking the NSX CLI command:
    get bgp neighbor summary
  5. Also, verify the configuration in both NSX and the BFD peer to ensure that timers have not been changed.

Transport Node Health

Transport node health events arise from the KVM and ESXi nodes.

Event Name Severity Alert Message Recommended Action
LAG Member Down Medium

LACP reporting member down.

When event detected: "LACP reporting member down."

When event resolved: "LACP reporting member up."

Check the connection status of LAG members on hosts.
  1. In the NSX UI, navigate to Fabric > Nodes > Transport Nodes > Host Transport Nodes.
  2. In the Host Transport Nodes list, check the Node Status column.

    Find the Transport node with the degraded or down Node Status.

  3. Select <transport node> > Monitor.

    Find the bond (uplink) which is reporting degraded or down.

  4. Check the LACP member status details by logging into the failed host and running the appropriate command:
    • ESXi: esxcli network vswitch dvs vmware lacp status get
    • KVM: ovs-appctl bond/show and ovs-appctl lacp/show
N-VDS Uplink Down Medium

Uplink is going down.

When event detected: "Uplink is going down."

When event resolved: "Uplink is going up."

Check the physical NICs status of uplinks on hosts.
  1. In the NSX UI, navigate to Fabric > Nodes > Transport Nodes > Host Transport Nodes.
  2. In the Host Transport Nodes list, check the Node Status column.

    Find the Transport node with the degraded or down Node Status.

  3. Select <transport node> > Monitor.

    Check the status details of the bond (uplink) which is reporting degraded or down.

    To avoid a degraded state, ensure all uplink interfaces are connected and up regardless of whether they are in use or not.

VPN Events

VPN events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action
IPsec Policy-Based Session Down Medium

Policy-based IPsec VPN session is down.

When event resolved: "The policy-based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}."

When event resolved: "The policy-based IPsec VPN session {entity_id} is up.

Check IPsec VPN session configuration and resolve errors based on the session down reason.

IPsec Route-Based Session Down Medium

Route-based IPsec VPN session is down.

When event resolved: "The route-based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}."

When event resolved: "The route-based IPsec VPN session {entity_id} is up."

Check IPsec VPN session configuration and resolve errors based on the session down reason.

IPsec Policy-Based Tunnel Down Medium

Policy-based IPsec VPN tunnels are down.

When event resolved: "One or more policy-based IPsec VPN tunnels in session {entity_id} are down."

When event resolved: "All policy-based IPsec VPN tunnels in session {entity_id} are up."

Check IPsec VPN session configuration and resolve errors based on the tunnel down reason.

IPsec Route-Based Tunnel Down Medium

Route-based IPsec VPN tunnels are down.

When event resolved: "One or more route-based IPsec VPN tunnels in session {entity_id} are down."

When event resolved: "All route-based IPsec VPN tunnels in session {entity_id} are up."

Check IPsec VPN session configuration and resolve errors based on the tunnel down reason.

L2VPN Session Down Medium

L2VPN session is down.

When event resolved: "The L2VPN session {entity_id} is down."

When event resolved: "The L2VPN session {entity_id} is up."

Check IPsec VPN session configuration and resolve errors based on the reason.