Event Catalog

Alarm Management Events

Alarm management events arise from the NSX Manager and Global Manager nodes.

Event Name	Severity	Alert Message	Recommended Action
Alarm Service Overloaded	Critical	The alarm service is overloaded. When event detected: "Due to heavy volume of alarms reported, the alarm service is temporarily overloaded. The NSX UI and GET /api/v1/alarm NSX API have stopped reporting new alarms. Syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of alarms are addressed, the alarm service starts reporting new alarms again." When event resolved: "The heavy volume of alarms has subsided and new alarms are being reported again."	Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm, investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new alarms again.
Heavy Volume of Alarms	Critical	Heavy volume of a specific alarm type detected. When event detected: "Due to heavy volume of `{event_id}` alarms, the alarm service has temporarily stopped reporting alarms of this type. The NSX UI and GET /api/v1/alarms NSX API are not reporting new instances of these alarms. Syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of `{event_id}` alarms are addressed, the alarm service starts reporting new `{event_id}` alarms when new issues are detected again." When event resolved: "The heavy volume of `{event_id}` alarms has subsided and new alarms of this type are being reported again."	Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm, investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new `{event_id}` alarms again.

Event Name

Severity

Alert Message

Recommended Action

Alarm Service Overloaded

Critical

The alarm service is overloaded.

When event detected: "Due to heavy volume of alarms reported, the alarm service is temporarily overloaded. The NSX UI and GET /api/v1/alarm NSX API have stopped reporting new alarms. Syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of alarms are addressed, the alarm service starts reporting new alarms again."

When event resolved: "The heavy volume of alarms has subsided and new alarms are being reported again."

Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm, investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new alarms again.

Heavy Volume of Alarms

Critical

Heavy volume of a specific alarm type detected.

When event detected: "Due to heavy volume of {event_id} alarms, the alarm service has temporarily stopped reporting alarms of this type. The NSX UI and GET /api/v1/alarms NSX API are not reporting new instances of these alarms. Syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of {event_id} alarms are addressed, the alarm service starts reporting new {event_id} alarms when new issues are detected again."

When event resolved: "The heavy volume of {event_id} alarms has subsided and new alarms of this type are being reported again."

Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm, investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new {event_id} alarms again.

Certificates Events

Certificate events arise from the NSX Manager node.

Event Name Severity Alert Message Recommended Action

Certificate Expired

Critical

Event Name	Severity	Alert Message	Recommended Action
Certificate Expired	Critical	A certificate has expired. When event detected: "Certificate `{entity-id}` has expired." When event resolved: "The expired certificate `{entity-id}` has been removed or is no longer expired.	Ensure services that are currently using the certificate are updated to use a new, non-expired certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call: `POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>` where <cert-id> is the ID of a valid certificate reported by the API call `GET /api/v1/trust-management/certificates`. After the expired certificate is no longer in use, it should be deleted with the following API call: `DELETE /api/v1/trust-management/certificates/{entity_id}`
Certificate About to Expire	High	A certificate is about to expire. When event detected: "Certificate `{entity-id}` is about to expire." When event resolved: "The expiring certificate `{entity-id}` or is no longer about to expire."	Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call: `POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>` where <cert-id> is the ID of a valid certificate reported by the API call `GET /api/v1/trust-management/certificates`. After the expiring certificate is no longer in use, it should be deleted using the API call: `DELETE /api/v1/trust-management/certificates/{entity_id}`
Certificate Expiration Approaching	Medium	A certificate is approaching expiration. When event detected: "Certificate `{entity-id}` is approaching expiration." When event resolved: "The expiring certificate `{entity-id}` or is no longer approaching expiration."	Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call: `POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>` where <cert-id> is the ID of a valid certificate reported by the API call `GET /api/v1/trust-management/certificates`. After the expiring certificate is no longer in use, it should be deleted using the API call: `DELETE /api/v1/trust-management/certificates/{entity_id}`

A certificate has expired.

When event detected: "Certificate {entity-id} has expired."

When event resolved: "The expired certificate {entity-id} has been removed or is no longer expired.

Ensure services that are currently using the certificate are updated to use a new, non-expired certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:

POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>

where <cert-id> is the ID of a valid certificate reported by the API call GET /api/v1/trust-management/certificates.

After the expired certificate is no longer in use, it should be deleted with the following API call:

DELETE /api/v1/trust-management/certificates/{entity_id}

Certificate About to Expire

High

A certificate is about to expire.

When event detected: "Certificate {entity-id} is about to expire."

When event resolved: "The expiring certificate {entity-id} or is no longer about to expire."

Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:

POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>

where <cert-id> is the ID of a valid certificate reported by the API call GET /api/v1/trust-management/certificates.

After the expiring certificate is no longer in use, it should be deleted using the API call:

DELETE /api/v1/trust-management/certificates/{entity_id}

Certificate Expiration Approaching

Medium

A certificate is approaching expiration.

When event detected: "Certificate {entity-id} is approaching expiration."

When event resolved: "The expiring certificate {entity-id} or is no longer approaching expiration."

Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:

POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id>

where <cert-id> is the ID of a valid certificate reported by the API call GET /api/v1/trust-management/certificates.

After the expiring certificate is no longer in use, it should be deleted using the API call:

DELETE /api/v1/trust-management/certificates/{entity_id}

CNI Health Events

CNI health events arise from the ESXi and KVM nodes.

Event Name	Severity	Alert Message	Recommended Action
Hyperbus Manager Connection Down	Medium	Hyperbus cannot communicate with the Manager node. When event detected: "Hyperbus cannot communicate with the Manager node." When event resolved: "Hyperbus can communicate with the Manager node."	The hyperbus vmkernel interface (vmk50) may be missing. See Knowledge Base article 67432.

Event Name

Severity

Alert Message

Recommended Action

Hyperbus Manager Connection Down

Medium

Hyperbus cannot communicate with the Manager node.

When event detected: "Hyperbus cannot communicate with the Manager node."

When event resolved: "Hyperbus can communicate with the Manager node."

The hyperbus vmkernel interface (vmk50) may be missing. See Knowledge Base article 67432.

DHCP Events

DHCP events arise from the NSX Edge and public gateway nodes.

Event Name	Severity	Alert Message	Recommended Action
Pool Lease Allocation Failed	High	IP addresses in an IP Pool have been exhausted. When event detected: "The addresses in IP Pool `{entity_id}` of DHCP Server `{dhcp_server_id}` have been exhausted. The last DHCP request has failed and future requests will fail." When event resolved: "IP Pool `{entity_id}` of DHCP Server `{dhcp_server_id}` is no longer exhausted. A lease is successfully allocated to the last DHCP request."	Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking > Segments > Segment page in the NSX UI.
Pool Overloaded	Medium	An IP Pool is overloaded. When event detected: "DHCP Server `{dhcp_server_id}` IP Pool `{entity_id}` usage is approaching exhaustion with `{dhcp_pool_usage}`% IPs allocated." When event resolved: "The DHCP Server `{dhcp_server_id}` IP Pool `{entity_id}` has fallen below the high usage threshold."	Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking > Segments > Segment page in the NSX UI.

Event Name

Severity

Alert Message

Recommended Action

Pool Lease Allocation Failed

High

IP addresses in an IP Pool have been exhausted.

When event detected: "The addresses in IP Pool {entity_id} of DHCP Server {dhcp_server_id} have been exhausted. The last DHCP request has failed and future requests will fail."

When event resolved: "IP Pool {entity_id} of DHCP Server {dhcp_server_id} is no longer exhausted. A lease is successfully allocated to the last DHCP request."

Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool.

Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease.

Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking > Segments > Segment page in the NSX UI.

Pool Overloaded

Medium

An IP Pool is overloaded.

When event detected: "DHCP Server {dhcp_server_id} IP Pool {entity_id} usage is approaching exhaustion with {dhcp_pool_usage}% IPs allocated."

When event resolved: "The DHCP Server {dhcp_server_id} IP Pool {entity_id} has fallen below the high usage threshold."

Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool.

Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease.

Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking > Segments > Segment page in the NSX UI.

Distributed Firewall Events

Distributed firewall events arise from the NSX Manager or ESXi nodes.

Event Name	Severity	Alert Message	Recommended Action
Distributed Firewall CPU Usage Very High	Critical	Distributed firewall CPU usage is very high. When event detected: "The DFW CPU usage on Transport node `{entity_id}` has reached `{system_resource_usage}`% which is at or above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "DNS forwarder `{entity_id}` is running again."	Consider re-balancing the VM workloads on this host to other hosts. Please review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter.
Distributed Firewall Memory Usage Very High	Critical	Distributed firewall memory usage is very high. When event detected: "The DFW memory usage `{heap_type}` on Transport Node `{entity_id}` has reached `{system_resource_usage}`% which is at or above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "The DFW memory usage `{heap_type}` on Transport Node `{entity_id}` has reached `{system_resource_usage}`% which is below the very high threshold value of `{system_usage_threshold}`%."	View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the host. Consider re-balancing the workloads on this host to other hosts.

Event Name

Severity

Alert Message

Recommended Action

Distributed Firewall CPU Usage Very High

Critical

Distributed firewall CPU usage is very high.

When event detected: "The DFW CPU usage on Transport node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "DNS forwarder {entity_id} is running again."

Consider re-balancing the VM workloads on this host to other hosts.

Please review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter.

Distributed Firewall Memory Usage Very High

Critical

Distributed firewall memory usage is very high.

When event detected: "The DFW memory usage {heap_type} on Transport Node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%."

When event resolved: "The DFW memory usage {heap_type} on Transport Node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%."

View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the host.

Consider re-balancing the workloads on this host to other hosts.

DNS Events

DNS events arise from the NSX Edge and public gateway nodes.

Event Name	Severity	Alert Message	Recommended Action
Forwarder Down	High	A DNS forwarder is down. When event detected: "DNS forwarder `{entity_id}` is not running. This is impacting all configured DNS Forwarders that are currently enabled." When event resolved: "DNS forwarder `{entity_id}` is running again."	Invoke the NSX CLI command get dns-forwarders status to verify if the DNS forwarder is in down state. Check /var/log/syslog to see if there are errors reported. Collect a support bundle and contact the NSX support team.
Forwarder Disabled	High	A DNS forwarder is disabled. When event detected: "DNS forwarder `{entity_id}` is disenabled." When event resolved: ""DNS forwarder `{entity_id}` is enabled."	Invoke the NSX CLI commandget dns-forwarders status to verify if the DNS forwarder is in a disabled state. Use NSX Policy API or Manager API to enable the DNS forwarder it should not be in the disabled state.

Event Name

Severity

Alert Message

Recommended Action

Forwarder Down

High

A DNS forwarder is down.

When event detected: "DNS forwarder {entity_id} is not running. This is impacting all configured DNS Forwarders that are currently enabled."

When event resolved: "DNS forwarder {entity_id} is running again."

Invoke the NSX CLI command get dns-forwarders status to verify if the DNS forwarder is in down state.
Check /var/log/syslog to see if there are errors reported.
Collect a support bundle and contact the NSX support team.

Forwarder Disabled

High

A DNS forwarder is disabled.

When event detected: "DNS forwarder {entity_id} is disenabled."

When event resolved: ""DNS forwarder {entity_id} is enabled."

Invoke the NSX CLI commandget dns-forwarders status to verify if the DNS forwarder is in a disabled state.
Use NSX Policy API or Manager API to enable the DNS forwarder it should not be in the disabled state.

Edge Health Events

Edge health events arise from the NSX Edge and public gateway nodes.

Event Name	Severity	Alert Message	Recommended Action
Edge CPU Usage Very High	Critical	Edge node CPU usage is very high. When event detected: "The CPU usage on the Edge node `{entity-id}` has reached `{system_resource_usage}`%, which is at or above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "The CPU usage on the Edge node `{entity-id}` has reached `{system_resource_usage}`%, which is below the very high threshold value of `{system_usage_threshold}`%."	Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.
Edge CPU Usage High	Medium	Edge node CPU usage is high. When event detected: "The CPU usage on the Edge node `{entity-id}` has reached `{system_resource_usage}`%, which is at or above the high threshold value of `{system_usage_threshold}`%." When event resolved: "The CPU usage on the Edge node `{entity-id}` has reached `{system_resource_usage}`%, which is below the high threshold value of `{system_usage_threshold}`%."	Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.
Edge Datapath Configuration Failure	High	Edge node datapath configuration has failed. When event detected: "Failed to enable the datapath on the Edge node after three attempts." When event resolved: "Datapath on the Edge node has been successfully enabled."	Ensure the Edge node connection to the Manager node is healthy. From the Edge node NSX CLI, invoke the command get services to check the health of services. If the dataplane service is stopped, invoke the command start service dataplane to restart it.
Edge Datapath CPU Usage Very High	Critical	Edge node datapath CPU usage is very high. When event detected: "The datapath CPU usage on Edge node `{entity-id}` has reached `{datapath_resource_usage}`% which is at or above the very high threshold for at least two minutes." When event resolved: "Datapath CPU usage on Edge node `{entity-id}` has reduced below the maximum threshold."	Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters.
Edge Datapath CPU Usage High	Medium	Edge node datapath CPU usage is high. When event detected: "The datapath CPU usage on Edge node `{entity-id}` has reached `{datapath_resource_usage}`% which is at or above the high threshold for at least two minutes." When event resolved: "The CPU usage on Edge node `{entity-id}` has reached below the high threshold."	Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters.
Edge Datapath Crypto Driver Down	Critical	The Edge node datapath crypto driver is down. When event detected: "Edge node crypto driver is down." When event resolved: "Edge node crypto driver is up."	Upgrade the Edge node as needed.
Edge Datapath Memory Pool is High	Medium	The Edge node datapath memory pool is high. When event detected: "The datapath mempool usage for `{mempool_name}` on Edge node `{entity-id}` has reached `{system_resource_usage}`% which is at or above the high threshold value of `{system_usage_threshold}`%." When event resolved: "The datapath mempool usage for `{mempool_name}` on Edge node `{entity-id}` has reached `{system_resource_usage}`% which is below the high threshold value of `{system_usage_threshold}`%."	Log in as the root user and invoke the commands edge-appctl -t /var/run/vmware/edge/dpd.ctl mempool/show and `edge-appctl -t /var/run/vmware/edge/dpd.ctl memory/show malloc_heap` to check DPDK memory usage.
Edge Disk Usage Very High	Critical	Edge node disk usage is very high. When event detected: "The disk usage for the Edge node disk partition `{disk_partition_name}` has reached `{system_resource_usage}`%, which is at or above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "The disk usage for the Edge node disk partition `{disk_partition_name}` has reached `{system_resource_usage}`%, which is below the very high threshold value of `{system_usage_threshold}`%."	Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Edge Disk Usage High	Medium	Edge node disk usage is high. When event detected: "The disk usage for the Edge node disk partition `{disk_partition_name}` has reached `{system_resource_usage}`%, which is at or above the high threshold value of `{system_usage_threshold}`%." When event resolved: "The disk usage for the Edge node disk partition `{disk_partition_name}` has reached `{system_resource_usage}`%, which is below the high threshold value of `{system_usage_threshold}`%."	Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Edge Global ARP Table Usage High	Medium	The Edge node global ARP table usage is high. When event detected: "Global ARP table usage on Edge node `{entity-id}` has reached `{datapath_resource_usage}`% which is above the high threshold for over two minutes." When event resolved: "Global arp table usage on Edge node `{entity-id}` has reached below the high threshold."	Increase the ARP table size: Log in as the root user. Invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/show. Check if neigh cache usage is normal. If it is normal, invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/set_param max_entries to increase the ARP table size.
Edge Memory Usage Very High	Critical	Edge node memory usage is very high. When event detected: "The memory usage on the Edge node `{entity-id}` has reached `{system_resource_usage}`%, which is at or above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "The memory usage on the Edge node `{entity-id}` has reached `{system_resource_usage}`%, which is below the very high threshold value of `{system_usage_threshold}`%."	Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.
Edge Memory Usage High	Medium	Edge node memory usage is high. When event detected: "The memory usage on the Edge node `{entity-id}` has reached `{system_resource_usage}`%, which is at or above the high threshold value of `{system_usage_threshold}`%." When event resolved: "The memory usage on the Edge node `{entity-id}` has reached `{system_resource_usage}`%, which is below the high threshold value of `{system_usage_threshold}`%."	Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.
Edge NIC Link Status Down	Critical	Edge node NIC link is down. When event detected: "Edge node NIC `{edge_nic_name}` link is down." When event detected: "Edge node NIC `{edge_nic_name}` link is up."	On the Edge node, confirm if the NIC link is physically down by invoking the NSX CLI command get interfaces. If it is down, verify the cable connection.
Edge NIC Out of Receive Buffer	Critical	Edge node NIC receive descriptor ring buffer has no space left. When event detected: "Edge NIC `{edge_nic_name}` receive ring buffer has overflowed by `{rx_ring_buffer_overflow_percentage}`% on Edge node `{entity-id}` for over 60 seconds." When event resolved: "Edge NIC `{edge_nic_name}` receive ring buffer usage on Edge node `{entity-id}` is no longer overflowing."	Invoke the NSX CLI command get dataplane, and check the following: If PPS and CPU usage is high and check rx ring size by invoking using get dataplane \| find ring-size rx. If PPS and CPU is high and rx ring size is low, invoke set dataplane ring-size rx <ring-size>, and set <ring-size> to a higher value to accommodate incoming packets. If the above condition is not satisfied, and ring size is high and CPU usage is still high, the cause may be due to a dataplane processing overhead delay.
Edge NIC Out of Transmit Buffer	Critical	Edge node NIC transmit descriptor ring buffer has no space left. When event detected: "Edge node NIC `{edge_nic_name}` transmit ring buffer has overflowed by `{tx_ring_buffer_overflow_percentage}`% on Edge node `{entity-id}` for over 60 seconds." When event resolved: "Edge node NIC `{edge_nic_name}` transmit ring buffer usage on Edge node `{entity-id}` is no longer overflowing."	Invoke the NSX CLI command get dataplane, and check the following: If PPS and CPU usage is high and check rx ring size by invoking using get dataplane \| find ring-size tx. If If PPS and CPU is high and tx ring size is low, invoke set dataplane ring-size tx <ring-size>, and set <ring-size> to a higher value to accommodate outgoing packets. If the above condition is not satisfied, and ring size is high and CPU usage is low or nominal, the cause may be due to the transmit ring size setting on the hypervisor.
Storage Error	Critical	Starting in NSX-T Data Center 3.0.1. The following disk partitions on the Edge node are in read-only mode: {disk_partition_name} .	Examine the read-only partition to see if reboot resolves the issue, or the disk needs to be replaced. Refer to KB article https://kb.vmware.com/s/article/2146870.

Endpoint Protection Events

Endpoint protection events arise from the NSX Manager or ESXi nodes.

Event Name	Severity	Alert Message	Recommended Action
EAM Status Down	Critical	ESX Agent Manager (EAM) service on a compute manager is down. When event detected: "ESX Agent Manager (EAM) service on compute manager `{entity_id}` is down." When event resolved: "ESX Agent Manager (EAM) service on compute manager `{entity_id}` is either up or compute manager `{entity_id}` has been removed."	Restart the ESX Agent Manager (EAM) service: SSH into the vCenter node and run: service vmware-eam start
Partner Channel Down	Critical	Host module and Partner SVM connection is down. When event detected: "The connection between host module and Partner SVM `{entity_id}` is down." When event resolved: "The connection between host module and Partner SVM `{entity_id}` is up."	See Knowledge Base article 2148821 Troubleshooting NSX Guest Introspection and make sure that the Partner SVM identified by `{entity_id}` is reconnected to the host module.

Event Name

Severity

Alert Message

Recommended Action

EAM Status Down

Critical

ESX Agent Manager (EAM) service on a compute manager is down.

When event detected: "ESX Agent Manager (EAM) service on compute manager {entity_id} is down."

When event resolved: "ESX Agent Manager (EAM) service on compute manager {entity_id} is either up or compute manager {entity_id} has been removed."

Restart the ESX Agent Manager (EAM) service:

SSH into the vCenter node and run:
```
service vmware-eam start
```

Partner Channel Down

Critical

Host module and Partner SVM connection is down.

When event detected: "The connection between host module and Partner SVM {entity_id} is down."

When event resolved: "The connection between host module and Partner SVM {entity_id} is up."

See Knowledge Base article 2148821 Troubleshooting NSX Guest Introspection and make sure that the Partner SVM identified by {entity_id} is reconnected to the host module.

Federation Events

Federation events arise from the NSX Manager, NSX Edge, and the public gateway nodes.

Event Name	Severity	Alert Message	Recommended Action
LM to LM Synchronization Error	High	Starting in NSX-T Data Center 3.0.1. The synchronization between `{site_name}({site_id}` and `{remote_site_name}({remote_site_id}` failed for more than 5 minutes.	Invoke the NSX CLI command get site-replicator remote-sites to get the connection state between remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of master resolution. In this case, wait for approximately 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step. Check the connectivity from Local Manager (LM) in location `{site_name}{site_id}` to the LMs in location `{remote_site_name}{remote_site_id}`) via ping. If they are not pingable, check for flakiness in WAN connectivity. If there are no physical network connectivity issues, try the next step. Check the /var/log/cloudnet/nsx-ccp.log file on the Manager nodes in the local cluster in location `{site_name}({site_id}` that triggered the alarm to see if there are any cross-site communication errors. In addition, also look for errors being logged by the nsx-appl-proxy subcomponent within /var/log/syslog.
LM to LM Synchronization Warning	Medium	Starting in NSX-T Data Center 3.0.1. The synchronization between `{site_name}({site_id}` and `{remote_site_name}({remote_site_id}` failed.	Invoke the NSX CLI command get site-replicator remote-sites to get the connection state between remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of master resolution. In this case, wait for approximately 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step. Check the connectivity from Local Manager (LM) in location `{site_name}{site_id}` to the LMs in location `{remote_site_name}{remote_site_id}`) via ping. If they are not pingable, check for flakiness in WAN connectivity. If there are no physical network connectivity issues, try the next step. Check the /var/log/cloudnet/nsx-ccp.log file on the Manager nodes in the local cluster in location `{site_name}({site_id}` that triggered the alarm to see if there are any cross-site communication errors. In addition, also look for errors being logged by the nsx-appl-proxy subcomponent within /var/log/syslog.
RTEP BGP Down	High	Starting in NSX-T Data Center 3.0.1. RTEP BGP session from source IP `{bgp_source_ip}` to remote location `{remote_site_name}` neighbor IP `{bgp_neighbor_ip}` is down. Reason: `{failure_reason}`.	Invoke the NSX CLI command get logical-routers on the affected edge node. Switch to REMOTE_TUNNEL_VRF context Invoke the NSX CLI command get bgp neighbor to check the BGP neighbor. Alternatively, invoke the NSX API GET /api/v1/transport-nodes/<transport-node-id>/inter-site/bgp/summary to get the BGP neighbor status. Invoke the NSX CLI command get interfaces and check if the correct RTEP IP address is assigned to the interface with name remote-tunnel-endpoint. . Check if the ping is working successfully between assigned RTEP IP address `{bgp_source_ip}` and the remote location `{remote_site_name}` neighbor IP `{bgp_neighbor_ip}`. Check /var/log/syslog for any errors related to BGP. Invoke the API GET or PUT /api/v1/transport-nodes/<transport-node-id> to get/update remote_tunnel_endpoint configuration on the edge node. This will update the RTEP IP assigned to the affected edge node.

Event Name

Severity

Alert Message

Recommended Action

LM to LM Synchronization Error

High

Starting in NSX-T Data Center 3.0.1.

The synchronization between {site_name}({site_id} and {remote_site_name}({remote_site_id} failed for more than 5 minutes.

Invoke the NSX CLI command get site-replicator remote-sites to get the connection state between remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of master resolution. In this case, wait for approximately 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step.
Check the connectivity from Local Manager (LM) in location {site_name}{site_id} to the LMs in location {remote_site_name}{remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity. If there are no physical network connectivity issues, try the next step.
Check the /var/log/cloudnet/nsx-ccp.log file on the Manager nodes in the local cluster in location {site_name}({site_id} that triggered the alarm to see if there are any cross-site communication errors. In addition, also look for errors being logged by the nsx-appl-proxy subcomponent within /var/log/syslog.

LM to LM Synchronization Warning

Medium

Starting in NSX-T Data Center 3.0.1.

The synchronization between {site_name}({site_id} and {remote_site_name}({remote_site_id} failed.

Invoke the NSX CLI command get site-replicator remote-sites to get the connection state between remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of master resolution. In this case, wait for approximately 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step.
Check the connectivity from Local Manager (LM) in location {site_name}{site_id} to the LMs in location {remote_site_name}{remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity. If there are no physical network connectivity issues, try the next step.
Check the /var/log/cloudnet/nsx-ccp.log file on the Manager nodes in the local cluster in location {site_name}({site_id} that triggered the alarm to see if there are any cross-site communication errors. In addition, also look for errors being logged by the nsx-appl-proxy subcomponent within /var/log/syslog.

RTEP BGP Down

High

Starting in NSX-T Data Center 3.0.1.

RTEP BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is down. Reason: {failure_reason}.

Invoke the NSX CLI command get logical-routers on the affected edge node.
Switch to REMOTE_TUNNEL_VRF context
Invoke the NSX CLI command get bgp neighbor to check the BGP neighbor.
Alternatively, invoke the NSX API GET /api/v1/transport-nodes/<transport-node-id>/inter-site/bgp/summary to get the BGP neighbor status.
Invoke the NSX CLI command get interfaces and check if the correct RTEP IP address is assigned to the interface with name remote-tunnel-endpoint.
. Check if the ping is working successfully between assigned RTEP IP address {bgp_source_ip} and the remote location {remote_site_name} neighbor IP {bgp_neighbor_ip}.
Check /var/log/syslog for any errors related to BGP.
Invoke the API GET or PUT /api/v1/transport-nodes/<transport-node-id> to get/update remote_tunnel_endpoint configuration on the edge node. This will update the RTEP IP assigned to the affected edge node.

High Availability Events

High availability events arise from the NSX Edge and public cloud gateway nodes.

Event Name	Severity	Alert Message	Recommended Action
Tier0 Gateway Failover	High	A tier0 gateway has failed over. When event detected: "The tier0 gateway `{entity-id}` failover from `{previous_gateway_state}` to `{current_gateway_state}`." When event resolved: "The tier0 gateway `{entity-id}` is now up."	Determine the service that is down and restart it. Identify the tier0 VRF ID by running the NSX CLI command get logical-routers. Switch to the VRF context by running vrf <vrf-id>. View which service is down by running get high-availability status.
Tier1 Gateway Failover	High	A tier1 gateway has failed over. When event detected: "The tier1 gateway `{entity-id}` failover from `{previous_gateway_state}` to `{current_gateway_state}`." When event resolved: "The tier1 gateway `{entity-id}` is now up."	Determine the service that is down and restart it. Identify the tier1 VRF ID by running the NSX CLI command get logical-routers. Switch to the VRF context by running vrf <vrf-id>. View which service is down by running get high-availability status.

Event Name

Severity

Alert Message

Recommended Action

Tier0 Gateway Failover

High

A tier0 gateway has failed over.

When event detected: "The tier0 gateway {entity-id} failover from {previous_gateway_state} to {current_gateway_state}."

When event resolved: "The tier0 gateway {entity-id} is now up."

Determine the service that is down and restart it.

Identify the tier0 VRF ID by running the NSX CLI command get logical-routers.
Switch to the VRF context by running vrf <vrf-id>.
View which service is down by running get high-availability status.

Tier1 Gateway Failover

High

A tier1 gateway has failed over.

When event detected: "The tier1 gateway {entity-id} failover from {previous_gateway_state} to {current_gateway_state}."

When event resolved: "The tier1 gateway {entity-id} is now up."

Determine the service that is down and restart it.

Identify the tier1 VRF ID by running the NSX CLI command get logical-routers.
Switch to the VRF context by running vrf <vrf-id>.
View which service is down by running get high-availability status.

Infrastructure Communication Events

Infrastructure communication events arise from the NSX Edge, KVM, ESXi, and public gateway nodes.

Event Name	Severity	Alert Message	Recommended Action
Edge Tunnels Down	Critical	An Edge node's tunnel status is down. When event detected: "Overall tunnel status of Edge node `{entity_id}` is down." When event resolved: "The tunnels of Edge node `{entity_id}` have been restored."	Using SSH, log into the Edge node. Get the status. nsxcli get tunnel-ports On each tunnel, check the stats for any drops. get tunnel-port <UUID> stats Check the syslog file for any tunnel related errors.

Event Name

Severity

Alert Message

Recommended Action

Edge Tunnels Down

Critical

An Edge node's tunnel status is down.

When event detected: "Overall tunnel status of Edge node {entity_id} is down."

When event resolved: "The tunnels of Edge node {entity_id} have been restored."

Using SSH, log into the Edge node.
Get the status.
```
nsxcli get tunnel-ports
```
On each tunnel, check the stats for any drops.
```
get tunnel-port <UUID> stats
```
Check the syslog file for any tunnel related errors.

Infrastructure Service Events

Infrastructure service events arise from the NSX Edge and public gateway nodes.

Event Name Severity Alert Message Recommended Action

Edge Service Status Down

Critical

Event Name	Severity	Alert Message	Recommended Action
Edge Service Status Down	Critical	Edge service is down for at least one minute. When event detected: "The service `{edge_service_name}` is down for at least one minute." When event resolved: "The service `{edge_service_name}` is up."	On the Edge node, verify the service hasn't exited due to an error by looking for core dump files in the /var/log/core directory. To confirm whether the service is stopped, invoke the NSX CLI command get services. If so, run `start service <service-name>` to restart the service.
Edge Service Status Changed	Low	Edge service status has changed. When event detected: "The service `{edge_service_name}` changed from `{previous_service_state}` to `{current_service_state}`." When event resolved: "The service `{edge_service_name}` changed from `{previous_service_state}` to `{current_service_state}`."	On the Edge node, verify the service hasn't exited due to an error by looking for core dump files in the /var/log/core directory. To confirm whether the service is stopped, invoke the NSX CLI command get services. If so, run `start service <service-name>` to restart the service.

Edge service is down for at least one minute.

When event detected: "The service {edge_service_name} is down for at least one minute."

When event resolved: "The service {edge_service_name} is up."

On the Edge node, verify the service hasn't exited due to an error by looking for core dump files in the /var/log/core directory.

To confirm whether the service is stopped, invoke the NSX CLI command get services.

If so, run start service <service-name> to restart the service.

Edge Service Status Changed

Low

Edge service status has changed.

When event detected: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}."

When event resolved: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}."

On the Edge node, verify the service hasn't exited due to an error by looking for core dump files in the /var/log/core directory.

To confirm whether the service is stopped, invoke the NSX CLI command get services.

If so, run start service <service-name> to restart the service.

Intelligence Communication Events

NSX Intelligence communication events arise from the NSX Manager node, ESXi node, and NSX Intelligence appliance.

Event Name	Severity	Alert Message	Recommended Action
Transport node flow exporter disconnected	High	A Transport node is disconnected from its Intelligence node's messaging broker. Data collection is affected. When event detected: "The flow exporter on Transport node `{entity-id}` is disconnected from the Intelligence node's messaging broker. Data collection is affected." When event resolved: "The flow exporter on Transport node `{entity-id}` has reconnected to the Intelligence node's messaging broker."	Restart messaging service if it is not running in the NSX Intelligence node. Resolve the network connection failure between the transport node and the NSX Intelligence node.
Control Channel to Transport Node Down	Critical	Control channel to Transport Node Down. When event detected: Controller service `central_control_plane_id` to Transport node `{entity-id}` down for atleast three minutes from Controller services point of view. When event resolved: Controller service `central_control_plane_id` restores connection to Transport node `{entity-id}` .	Check the connectivity from the Controller service `central_control_plane_id` and Transport node `{entity-id}` inerface by using the ping command. If they are not pingable, check for flakiness in network connectivity. Check to see if the TCP connections are established using the netstat output to see if the Controller service `{central_control_plane_id}` is listening for connections on port 1235. If not, check firewall (or) iptables rules to see if port 1235 is blocking Transport node `{entity_id}` connection requests. Ensure that there are no host firewalls or network firewalls in the underlay are blocking the required IP ports between Manager nodes and Transport nodes. This is documented in our ports and protocols tool which is here: https://ports.vmware.com/. It is possible that the Transport node `{entity_id}` may still be in maintenance mode. You can check whether the Transport node is in maintenance mode via the following API: GET https://<nsx-mgr>/api/v1/transport-nodes/<tn-uuid> When maintenance mode is set, the Transport node will not be connected to the Controller service. This is usually the case when host upgrade is in progress. Wait for a few minutes and check connectivity again. Note: This alarm is not critical and should be resolved. GSS need not be contacted for the notification of this alarm unless the alarm remains unresolved over an extended period of time.
Control Channel to Transport Node Down for too long	Warning	Control channel to Transport Node Down for too long. When event detected: Controller service `central_control_plane_id` to Transport node `{entity-id}` down for atleast 15 minutes from Controller services point of view. When event resolved: Controller service `central_control_plane_id` restores connection to Transport node `{entity-id}`.	Check the connectivity from the Controller service `central_control_plane_id` and Transport node `{entity-id}` inerface by using the ping command. If they are not pingable, check for flakiness in network connectivity. Check to see if the TCP connections are established using the netstat output to see if the Controller service `{central_control_plane_id}` is listening for connections on port 1235. If not, check firewall (or) iptables rules to see if port 1235 is blocking Transport node `{entity_id}` connection requests. Ensure that there are no host firewalls or network firewalls in the underlay are blocking the required IP ports between Manager nodes and Transport nodes. This is documented in our ports and protocols tool which is here: https://ports.vmware.com/. It is possible that the Transport node `{entity_id}` may still be in maintenance mode. You can check whether the Transport node is in maintenance mode via the following API: GET https://<nsx-mgr>/api/v1/transport-nodes/<tn-uuid> When maintenance mode is set, the Transport node will not be connected to the Controller service. This is usually the case when host upgrade is in progress. Wait for a few minutes and check connectivity again.
Management Channel To Transport Node Down	Critical	Disconnection from Manager node to transport node. When event detected: When event resolved	Ensure that there is a network connectivity between Manager nodes and the Transport node `nodename (IP)` and there are no firewalls blocking the traffic between these nodes. Ensure the nsx-proxy service is running on the Transport node by invoking the following command. /etc/init.d/nsx-prxy status If the nsx-proxy service is not running, restart it by invoking the following command. /etc/init.d/nsx-proxy restart
Manager Control Channel Down	Critical	Manager to controller channel is down. When event detected: When event resolved:	On the Manager node `managernode (IP)`, invoke the following two NSX CLI commands: restart service mgmt-plane-bus restart service manage

Intelligence Health Events

NSX Intelligence health events arise from the NSX Manager node and NSX Intelligence appliance.

Event Name	Severity	Alert Message	Recommended Action
CPU Usage Very High	Critical	Intelligence node CPU usage is very high. When event detected: "The CPU usage on NSX Intelligence node `{intelligence_node_id}` is above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "The CPU usage on NSX Intelligence node `{intelligence_node_id}` is below the very high threshold value of {system_usage_threshold}%."	Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.
CPU Usage High	Medium	Intelligence node CPU usage is high. When event detected: "The CPU usage on NSX Intelligence node `{intelligence_node_id}` is above the high threshold value of `{system_usage_threshold}`%." When event resolved: "The CPU usage on NSX Intelligence node `{intelligence_node_id}` is below the high threshold value of `{system_usage_threshold}`%."	Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.
Memory Usage Very High	Critical	Intelligence node memory usage is very high. When event detected: "The memory usage on NSX Intelligence node `{intelligence_node_id}` is above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "The memory usage on NSX Intelligence node `{intelligence_node_id}` is below the very high threshold value of `{system_usage_threshold}`%."	Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.
Memory Usage High	Medium	Intelligence node memory usage is high. When event detected: "The memory usage on NSX Intelligence node `{intelligence_node_id}` is above the high threshold value of `{system_usage_threshold}`%." When event resolved: "The memory usage on NSX Intelligence node `{intelligence_node_id}` is below the high threshold value of `{system_usage_threshold}`%."	Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.
Disk Usage Very High	Critical	Intelligence node disk usage is very high. When event detected: "The disk usage of disk partition `{disk_partition_name}` on the NSX Intelligence node `{intelligence_node_id}` is above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "The disk usage of disk partition `{disk_partition_name}` on the NSX Intelligence node `{intelligence_node_id}` is below the very high threshold value of `{system_usage_threshold}`%."	Examine disk partition `{disk_partition_name}` and see if there are any unexpected large files that can be removed.
Disk Usage High	Medium	Intelligence node disk usage is high. When event detected: "The disk usage of disk partition `{disk_partition_name}` on the NSX Intelligence node `{intelligence_node_id}` is above the high threshold value of `{system_usage_threshold}`%." When event resolved: "The disk usage of disk partition `{disk_partition_name}` on the NSX Intelligence node `{intelligence_node_id}` is below the high threshold value of `{system_usage_threshold}`%."	Examine disk partition `{disk_partition_name}` and see if there are any unexpected large files that can be removed.
Data disk partition usage very high	Critical	Intelligence node data disk partition usage is very high. When event detected: "The disk usage of disk partition /data on NSX Intelligence node `{intelligence_node_id}` is above the very high threshold value of `{system_usage_threshold}`%. When event resolved: "The disk usage of disk partition /data on NSX Intelligence node `{intelligence_node_id}` is below the very high threshold value of `{system_usage_threshold}`%."	Stop NSX Intelligence data collection until the disk usage is below the threshold. In the NSX UI, navigate to System Appliances NSX Intelligence Appliance. Then select ACTIONS > Stop Collecting Data.
Data disk partition usage high	Medium	Intelligence node data disk partition usage is high. When event detected: "The disk usage of disk partition /data on NSX Intelligence node `{intelligence_node_id}` is above the high threshold value of `{system_usage_threshold}`%. When event resolved: "The disk usage of disk partition /data on NSX Intelligence node `{intelligence_node_id}` is below the high threshold value of `{system_usage_threshold}`%."	Stop NSX Intelligence data collection until the disk usage is below the threshold. Examine the /data partition and see if there are any unexpected large files that can be removed.
Node status degraded	High	Intelligence node status is degraded. When event detected: "Service `{service_name}`on NSX Intelligence node `{intelligence_node_id}` is not running." When event resolved: "Service `{service_name}`on NSX Intelligence node `{intelligence_node_id}` is running properly."	Examine service status and health information with NSX CLI command get services in the NSX Intelligence node. Restart unexpected stopped services with NSX CLI command restart service <service-name>.

License Events

License events arise from the NSX Manager node.

Event Name	Severity	Alert Message	Recommended Action
License Expired	Critical	A license has expired. When event detected: "The license of type `{license_edition_type}` has expired." When event resolved: "The expired license of type `{license_edition_type}` has been removed, updated, or is no longer expired."	Add a new, non-expired license: In the NSX UI, by navigate to System > Licenses. Click Add and specify the key of the new license. Delete the expired license by selecting the check box and clicking Unassign.
License About to Expire	Medium	When event detected: "The license of type `{license_edition_type}` is about to expire." When event resolved: "The expiring license identified by `{license_edition_type}`has been removed, updated, or is no longer about to expire."	Add a new, non-expired license: In the NSX UI, by navigate to System > Licenses. Click Add and specify the key of the new license. Delete the expired license by selecting the check box and clicking Unassign.

Event Name

Severity

Alert Message

Recommended Action

License Expired

Critical

A license has expired.

When event detected: "The license of type {license_edition_type} has expired."

When event resolved: "The expired license of type {license_edition_type} has been removed, updated, or is no longer expired."

Add a new, non-expired license:

In the NSX UI, by navigate to System > Licenses.
Click Add and specify the key of the new license.
Delete the expired license by selecting the check box and clicking Unassign.

License About to Expire

Medium

When event detected: "The license of type {license_edition_type} is about to expire."

When event resolved: "The expiring license identified by {license_edition_type}has been removed, updated, or is no longer about to expire."

Add a new, non-expired license:

In the NSX UI, by navigate to System > Licenses.
Click Add and specify the key of the new license.
Delete the expired license by selecting the check box and clicking Unassign.

Load Balancer Events

Load balancer events arise from the NSX Edge node.

Event Name	Severity	Alert Message	Recommended Action
Load Balancer CPU Very High	Medium	Load balancer CPU usage is very high. When event detected: "The CPU usage of load balancer `{entity_id}` is `{system_resource_usage}`%, which is higher than the very high threshold of `{system_usage_threshold}`%." When event resolved: "The CPU utilization of load balancer `{entity_id}` is `{system_resource_usage}`%, which is lower than the very high threshold of `{system_usage_threshold}`%."	If the load balancer CPU utilization of is higher than `{system_usage_threshold}`%, the workload is too high for this load balancer. Rescale the load balancer service by changing the load balancer size from small to medium or from medium to large. If the CPU utilization of this load balancer is still high, consider adjusting the Edge appliance form factor size or moving load balancer services to other Edge nodes for the applicable workload.
Load Balancer Status Down	Medium	Load balancer service is down. When event detected: "The load balancer service `{entity_id}` is down." When event resolved: "The load balancer service `{entity_id}` is up."	Verify whether the load balancer service in the Edge node is running. If the status of the load balancer service is not ready, move the Edge node into maintenance mode, then exit maintenance mode. If the status of the load balancer service is still not recovered, please check whether there are any error log in syslog.
Virtual Server Status Down	Medium	Load balancer virtual service is down. When event detected: "The load balancer virtual server `{entity_id}` is down." When event resolved: "The load balancer virtual server `{entity_id}` is up."	Consult the load balancer pool to determine its status and verify its configuration. If incorrectly configured, reconfigure it and remove the load balancer pool from the virtual server then re-add it to the virtual server again.
Pool Status Down	Medium	When event detected: "The load balancer pool `{entity_id}` status is down." When event resolved: "The load balancer pool `{entity_id}` status is up."	Consult the load balancer pool to determine which members are down. Check network connectivity from the load balancer to the impacted pool members. Validate application health of each pool member. Validate the health of each pool member using the configured monitor. When the health of the member is established, the pool member status is updated to healthy based on the Rise Count.

Manager Health Events

NSX Manager health events arise from the NSX Manager node cluster.

Event Name	Severity	Alert Message	Recommended Action
Duplicate IP Address	Medium	Manager node's IP address is in use by another device. When event detected: "Manager node `{entity_id}` IP address `{duplicate_ip_address}` is currently being used by another device in the network." When event detected: "Manager node `{entity_id}` appears to no longer be using `{duplicate_ip_address}`."	Determine which device is using the Manager's IP address and assign the device a new IP address. Note: Reconfiguring the Manager to use a new IP address is not supported. Verify if the static IP address pool/DHCP server is configured correctly. Correct the IP address of the device if it is manually assigned.
Manager CPU Usage Very High	Critical	Manager node CPU usage is very high. When event detected: "The CPU usage on the Manager node `{entity_id}` has reached `{system_resource_usage}`%, which is at or above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "The CPU usage on the Manager node `{entity_id}` has reached `{system_resource_usage}`%, which is below the very high threshold value of `{system_usage_threshold}`%."	Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size.
Manager CPU Usage High	Medium	Starting in NSX-T Data Center 3.0.1. Manager node CPU usage is high. When event detected: "The CPU usage on the Manager node `{entity_id}` has reached `{system_resource_usage}`%, which is at or above the high threshold value of `{system_usage_threshold}`%." When event resolved: "The CPU usage on the Manager node `{entity_id}` has reached `{system_resource_usage}`%, which is below the high threshold value of `{system_usage_threshold}`%."	Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size.
Manager Memory Usage Very High	Critical	Starting in NSX-T Data Center 3.0.1. Manager node memory usage is very high. When event detected: "The memory usage on the Manager node `{entity_id}` has reached `{system_resource_usage}`%, which is at or above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "The memory usage on the Manager node `{entity_id}` has reached `{system_resource_usage}`%, which is below the very high threshold value of `{system_usage_threshold}`%."	Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size.
Manager Memory Usage High	Medium	Manager node memory usage is high. When event detected: "The memory usage on the Manager node `{entity_id}` has reached `{system_resource_usage}`%, which is at or above the high threshold value of `{system_usage_threshold}`%." When event resolved: "The memory usage on the Manager node `{entity_id}` has reached `{system_resource_usage}`%, which is below the high threshold value of `{system_usage_threshold}`%."	Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size.
Manager Disk Usage Very High	Critical	Manager node disk usage is very high. When event detected: "The disk usage for the Manager node disk partition `{disk_partition_name}` has reached `{system_resource_usage}`%, which is at or above the very high threshold value of `{system_usage_threshold}`%." When event resolved: "The disk usage for the Manager node disk partition `{disk_partition_name}` has reached `{system_resource_usage}`%, which is below the very high threshold value of `{system_usage_threshold}`%."	Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Manager Disk Usage High	Medium	Manager node disk usage is high. When event detected: "The disk usage for the Manager node disk partition `{disk_partition_name}` has reached `{system_resource_usage}`%, which is at or above the high threshold value of `{system_usage_threshold}`%." When event resolved: "The disk usage for the Manager node disk partition `{disk_partition_name}` has reached `{system_resource_usage}`%, which is below the high threshold value of `{system_usage_threshold}`%."	Examine the partition with high usage and see if there are any unexpected large files that can be removed.
Manager Configuration Disk Usage Very High	Critical	Manager node config disk usage is very high. When event detected: "The disk usage for the Manager node disk partition /config has reached `{system_resource_usage}`%, which is at or above the very high threshold value of `{system_usage_threshold}`%. This can be an indication of high disk usage by the NSX Datastore service under the /config/corfu directory." When event resolved: "The disk usage for the Manager node disk partition /config has reached `{system_resource_usage}`%, which is below the very high threshold value of `{system_usage_threshold}`%."	Examine the /config partition and see if there are any unexpected large files that can be removed.
Manager Configuration Disk Usage High	Medium	Manager node config disk usage is high. When event detected: "The disk usage for the Manager node disk partition /config has reached `{system_resource_usage}`%, which is at or above the high threshold value of `{system_usage_threshold}`%. This can be an indication of rising disk usage by the NSX Datastore service under the /config/corfu directory." When event resolved: "The disk usage for the Manager node disk partition /config has reached `{system_resource_usage}`%, which is below the high threshold value of `{system_usage_threshold}`%."	Examine the /config partition and see if there are any unexpected large files that can be removed.
Operations DB Disk Usage High	Medium	The disk usage for the Manager node disk partition /nonconfig has reached `{system_resource_usage}%` which is at or above the high threshold value of `{system_usage_threshold}%`. This can be an indication of rising disk useage by the NSX Datastore service under the /nonconfig/corfu directory.	Please run the following tool, and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig.
Operations DB Disk Usage Very High	Critical	The disk usage for the Manager node disk partition /nonconfig has reached `{system_resource_usage}%` which is at or above the very high threshold value of `{system_usage_threshold}%`. This can be an indication of rising disk useage by the NSX Datastore service under the /nonconfig/corfu directory.	Please run the following tool, and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig.

NCP Events

NSX Container Plug-in (NCP) events arise from the ESXi and KVM nodes.

Event Name	Severity	Alert Message	Recommended Action
NCP Plugin Down	Critical	Manager Node has detected the NCP is down or unhealthy. When event detected: "Manager node has detected the NCP is down or unhealthy." When event resolved: "Manager Node has detected the NCP is up or healthy again."	To find the clusters which are having issues, invoke the NSX API: GET /api/v1/systemhealth/container-cluster/ncp/status to fetch all cluster statuses and determine the name of any clusters that report DOWN or UNKNOWN. Go to the NSX UI Inventory > Container > Clusters page to find the names of clusters that reported DOWN or UNKNOWN status and click the Nodes tab which lists all Kubernetes and PAS cluster members. For Kubernetes cluster: Check NCP Pod liveness by finding the K8s master node from all the cluster members and log onto the master node. Then invoke the kubectl command kubectl get pods --all-namespaces. If there is an issue with the NCP Pod, please use kubectl logs command to check the issue and fix the error. Check the connection between NCP and Kubernetes API server. The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following commands from the master VM. kubectl exec -it <NCP-Pod-Name> -n nsx-system bash nsxcli get ncp-k8s-api-server status If there is an issue with the connection, please check both the network and NCP configurations. Check the connection between NCP and NSX Manager. The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following command from the master VM. kubectl exec -it <NCP-Pod-Name> -n nsx-system bash nsxcli get ncp-nsx status If there is an issue with the connection, please check both the network and NCP configurations. For PAS cluster: Check the network connections between virtual machines and fix any network issues. Check the status of both nodes and services and fix crashed nodes or services. Invoke the commands bosh vms and bosh instances -p to check the status of nodes and services.

Event Name

Severity

Alert Message

Recommended Action

NCP Plugin Down

Critical

Manager Node has detected the NCP is down or unhealthy.

When event detected: "Manager node has detected the NCP is down or unhealthy."

When event resolved: "Manager Node has detected the NCP is up or healthy again."

To find the clusters which are having issues, invoke the NSX API: GET /api/v1/systemhealth/container-cluster/ncp/status to fetch all cluster statuses and determine the name of any clusters that report DOWN or UNKNOWN.

Go to the NSX UI Inventory > Container > Clusters page to find the names of clusters that reported DOWN or UNKNOWN status and click the Nodes tab which lists all Kubernetes and PAS cluster members.

For Kubernetes cluster:

Check NCP Pod liveness by finding the K8s master node from all the cluster members and log onto the master node.
Then invoke the kubectl command kubectl get pods --all-namespaces. If there is an issue with the NCP Pod, please use kubectl logs command to check the issue and fix the error.
Check the connection between NCP and Kubernetes API server.
The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following commands from the master VM.
```
kubectl exec -it <NCP-Pod-Name> -n nsx-system bash
nsxcli
get ncp-k8s-api-server status
```
If there is an issue with the connection, please check both the network and NCP configurations.
Check the connection between NCP and NSX Manager.
The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following command from the master VM.
```
kubectl exec -it <NCP-Pod-Name> -n nsx-system bash nsxcli get ncp-nsx status
```
If there is an issue with the connection, please check both the network and NCP configurations.

For PAS cluster:

Check the network connections between virtual machines and fix any network issues.
Check the status of both nodes and services and fix crashed nodes or services.
Invoke the commands bosh vms and bosh instances -p to check the status of nodes and services.

Node Agents Health Events

Node agent health events arise from the ESXi and KVM nodes.

Event Name Severity Alert Message Recommended Action

Node Agents Down

High

Event Name	Severity	Alert Message	Recommended Action
Node Agents Down	High	The agents running inside the Node VM appear to be down. When event detected: "The agents running inside the node VM appear to be down." When event resolved: "The agents inside the Node VM are running."	For ESX: If Vmk50 is missing, see Knowledge Base article 67432. If Hyperbus 4094 is missing: restarting nsx-cfgagent or restarting the container host VM may help. If container host VIF is blocked, check the connection to the controller make sure all configurations are sent down. If nsx-cfgagent has stopped, please restart nsx-cfgagent. For KVM: If the Hyperbus namespace is missing, restarting the `nsx-opsagent` may help recreate the namespace. If Hyperbus interface is missing inside the hyperbus namespace, estarting the `nsx-opsagent` may help. If the `nsx-agent` has stopped, restart `nsx-agent`. For both ESX and KVM: If the `node-agent` package is missing: check whether the `node-agent` package has been successfully installed in the container host VM. If the interface for the `node-agent` in the container host VM is down: check the eth1 interface status inside the container host VM.

The agents running inside the Node VM appear to be down.

When event detected: "The agents running inside the node VM appear to be down."

When event resolved: "The agents inside the Node VM are running."

For ESX:

If Vmk50 is missing, see Knowledge Base article 67432.
If Hyperbus 4094 is missing: restarting nsx-cfgagent or restarting the container host VM may help.
If container host VIF is blocked, check the connection to the controller make sure all configurations are sent down.
If nsx-cfgagent has stopped, please restart nsx-cfgagent.

For KVM:

If the Hyperbus namespace is missing, restarting the nsx-opsagent may help recreate the namespace.
If Hyperbus interface is missing inside the hyperbus namespace, estarting the nsx-opsagent may help.
If the nsx-agent has stopped, restart nsx-agent.

For both ESX and KVM:

If the node-agent package is missing: check whether the node-agent package has been successfully installed in the container host VM.
If the interface for the node-agent in the container host VM is down: check the eth1 interface status inside the container host VM.

Password Management Events

Password management events arise from the NSX Manager, NSX Edge, and the public gateway nodes.

Event Name Severity Alert Message Recommended Action

Password expired

Critical

Event Name	Severity	Alert Message	Recommended Action
Password expired	Critical	User password has expired. When event detected: "The password for user `{username}` has expired." When event resolved: "The password for the user `{username}` has been changed successfully or is no longer expired."	The password for the user `{username}` must be changed now to access the system. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body: `PUT /api/v1/node/users/<userid>` where `<userid>` is the ID of the user. If the admin user (with `<userid>` 10000) password has expired, admin must login to the system via SSH (if enabled) or console in order to change the password. Upon entering the current expired password, admin will be prompted to enter a new password.
Password about to expire	High	User password is about to expire. When event detected: "The password for user `{username}` is about to expire in `{password_expiration_days}` days."" When event resolved: "The password for the user `{username`} has been changed successfully or is no longer about to expire."	Ensure the password for the user identified by `{username}` is changed immediately. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body: `PUT /api/v1/node/users/<userid>` where `<userid>` is the ID of the user.
Password expiration approaching	Medium	User password is approaching expiration. When event detected: "The password for user `{username}` is about to expire in {password_expiration_days} days." When event resolved: "The password for the user `{username}` has been changed successfully or is no longer about to expire."	The password for the user identified by `{username}` needs to be changed soon. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body: `PUT /api/v1/node/users/<userid>` where `<userid>` is the ID of the user.

User password has expired.

When event detected: "The password for user {username} has expired."

When event resolved: "The password for the user {username} has been changed successfully or is no longer expired."

The password for the user {username} must be changed now to access the system. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:

PUT /api/v1/node/users/<userid>

where <userid> is the ID of the user. If the admin user (with <userid> 10000) password has expired, admin must login to the system via SSH (if enabled) or console in order to change the password. Upon entering the current expired password, admin will be prompted to enter a new password.

Password about to expire

High

User password is about to expire.

When event detected: "The password for user {username} is about to expire in {password_expiration_days} days.""

When event resolved: "The password for the user {username} has been changed successfully or is no longer about to expire."

Ensure the password for the user identified by {username} is changed immediately. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:

PUT /api/v1/node/users/<userid>

where <userid> is the ID of the user.

Password expiration approaching

Medium

User password is approaching expiration.

When event detected: "The password for user {username} is about to expire in {password_expiration_days} days."

When event resolved: "The password for the user {username} has been changed successfully or is no longer about to expire."

The password for the user identified by {username} needs to be changed soon. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:

PUT /api/v1/node/users/<userid>

where <userid> is the ID of the user.

Routing Events

Event Name	Severity	Alert Message	Recommended Action
BGP Down	High	BGP neighbor down. When event detected: "In Router `{entity_id}`, BGP neighbor `{bgp_neighbor_ip}` is down, reason: `{failure_reason}`." When event resolved: "In Router `{entity_id}`, BGP neighbor `{bgp_neighbor_ip}` is up."	SSH into the Edge node. Invoke the NSX CLI command: `get logical-routers` Switch to the service router `{sr_id}`. Check /var/log/syslog to see if there are any errors related to BGP connectivity.
Bidirectional Forwarding Detection Down (BFD) on External Interface	High	BFD session is down. When event detected: "In router `{entity_id}`, BFD session for peer `{peer_address}` is down." When event resolved: "In router `{entity_id}`, BFD session for peer `{peer_address}` is up."	SSH into the Edge node. Invoke the NSX CLI command: `get logical-routers` Switch to the service router `{sr_id}`. Verify the connectivity by invoking the NSX CLI command: `ping <peer_address>`.
Routing Down	High	All BGP/BFD sessions are down. When event detected: "All BGP/BFD sessions are down." When event resolved: "At least one BGP/BFD sessions up."	Invoke the NSX CLI command `get logical-routers` to get the Tier0 service router. Switch to the Tier0 service router VRF, then invoke the following NSX CLI commands: Verify connectivity: `ping <BFD peer IP address>` Check BFD health: get bfd-config get bfd-sessions Check BGP health: `get bgp neighbor summary` get bfd neconfig get bfd-sessions Check /var/log/syslog to see if there are any errors related to BGP connectivity.
Static Routing Removed	High	Static route removed. When event detected: "In router `{entity_id}`, static route `{static_address}` was removed because BFD was down." When event resolved: "In router `{entity_id}`, static route `{static_address}` was re-added as BFD recovered."	SSH into the Edge node. Invoke the NSX CLI command: `get logical-routers` Switch to the service router `{sr_id}`. Verify the connectivity by invoking the NSX CLI command: get bgp neighbor summary Also, verify the configuration in both NSX and the BFD peer to ensure that timers have not been changed.

Transport Node Health

Transport node health events arise from the KVM and ESXi nodes.

Event Name	Severity	Alert Message	Recommended Action
LAG Member Down	Medium	LACP reporting member down. When event detected: "LACP reporting member down." When event resolved: "LACP reporting member up."	Check the connection status of LAG members on hosts. In the NSX UI, navigate to Fabric > Nodes > Transport Nodes > Host Transport Nodes. In the Host Transport Nodes list, check the Node Status column. Find the Transport node with the degraded or down Node Status. Select <transport node> > Monitor. Find the bond (uplink) which is reporting degraded or down. Check the LACP member status details by logging into the failed host and running the appropriate command: ESXi: esxcli network vswitch dvs vmware lacp status get KVM: ovs-appctl bond/show and ovs-appctl lacp/show
N-VDS Uplink Down	Medium	Uplink is going down. When event detected: "Uplink is going down." When event resolved: "Uplink is going up."	Check the physical NICs status of uplinks on hosts. In the NSX UI, navigate to Fabric > Nodes > Transport Nodes > Host Transport Nodes. In the Host Transport Nodes list, check the Node Status column. Find the Transport node with the degraded or down Node Status. Select <transport node> > Monitor. Check the status details of the bond (uplink) which is reporting degraded or down. To avoid a degraded state, ensure all uplink interfaces are connected and up regardless of whether they are in use or not.

Event Name

Severity

Alert Message

Recommended Action

LAG Member Down

Medium

LACP reporting member down.

When event detected: "LACP reporting member down."

When event resolved: "LACP reporting member up."

Check the connection status of LAG members on hosts.

In the NSX UI, navigate to Fabric > Nodes > Transport Nodes > Host Transport Nodes.
In the Host Transport Nodes list, check the Node Status column.
Find the Transport node with the degraded or down Node Status.
Select <transport node> > Monitor.
Find the bond (uplink) which is reporting degraded or down.
Check the LACP member status details by logging into the failed host and running the appropriate command:
- ESXi: esxcli network vswitch dvs vmware lacp status get
- KVM: ovs-appctl bond/show and ovs-appctl lacp/show

N-VDS Uplink Down

Medium

Uplink is going down.

When event detected: "Uplink is going down."

When event resolved: "Uplink is going up."

Check the physical NICs status of uplinks on hosts.

In the NSX UI, navigate to Fabric > Nodes > Transport Nodes > Host Transport Nodes.
In the Host Transport Nodes list, check the Node Status column.
Find the Transport node with the degraded or down Node Status.
Select <transport node> > Monitor.
Check the status details of the bond (uplink) which is reporting degraded or down.
To avoid a degraded state, ensure all uplink interfaces are connected and up regardless of whether they are in use or not.

VPN Events

VPN events arise from the NSX Edge and public gateway nodes.

Event Name	Severity	Alert Message	Recommended Action
IPsec Policy-Based Session Down	Medium	Policy-based IPsec VPN session is down. When event detected: "The policy-based IPsec VPN session `{entity_id}` is down. Reason: `{session_down_reason}`." When event resolved: "The policy-based IPsec VPN session `{entity_id}` is up.	Check IPsec VPN session configuration and resolve errors based on the session down reason.
IPsec Route-Based Session Down	Medium	Route-based IPsec VPN session is down. When event detected: "The route-based IPsec VPN session `{entity_id}` is down. Reason: `{session_down_reason}`." When event resolved: "The route-based IPsec VPN session `{entity_id}` is up."	Check IPsec VPN session configuration and resolve errors based on the session down reason.
IPsec Policy-Based Tunnel Down	Medium	Policy-based IPsec VPN tunnels are down. When event detected: "One or more policy-based IPsec VPN tunnels in session `{entity_id}` are down." When event resolved: "All policy-based IPsec VPN tunnels in session `{entity_id}` are up."	Check IPsec VPN session configuration and resolve errors based on the tunnel down reason.
IPsec Route-Based Tunnel Down	Medium	Route-based IPsec VPN tunnels are down. When event detected: "One or more route-based IPsec VPN tunnels in session `{entity_id}` are down." When event resolved: "All route-based IPsec VPN tunnels in session `{entity_id}` are up."	Check IPsec VPN session configuration and resolve errors based on the tunnel down reason.
L2VPN Session Down	Medium	L2VPN session is down. When event detected: "The L2VPN session `{entity_id}` is down." When event resolved: "The L2VPN session `{entity_id}` is up."	Check IPsec VPN session configuration and resolve errors based on the reason.

Identity Firewall Events

Event Name	Severity	Alert Message	Recommended Action
Connectivity to AD server	Critical	Connectivity to AD server is lost. When event detected: Connectivity to Identity Firewall AD server is down. When event detected: Connectivity to Identity Firewall AD server is up .	The AD server is reachable from NSX nodes. The AD server details are configured correctly in NSX. The AD server is running correctly. There are no firewalls blocking access between the AD server and NSX nodes. After fixing the connection issue, use the "TEST CONNECTION" in LDAP server UI to test the connection to AD server.
Errors during Delta Sync	Critical	Failure to Sync AD server `error description` When event detected: Failure during selective sync of Identity Firewall AD server: `error details`. When event detected: Selective sync errors of Identity Firewall AD server fixed.	Verify whether the load balancer service in the Edge node is running. If the status of the load balancer service is not ready, move the Edge node into maintenance mode, then exit maintenance mode. If the status of the load balancer service is still not recovered, please check whether there are any error log in syslog.

Event Name

Severity

Alert Message

Recommended Action

Connectivity to AD server

Critical

Connectivity to AD server is lost.

When event detected: Connectivity to Identity Firewall AD server is down.

When event detected: Connectivity to Identity Firewall AD server is up .

The AD server is reachable from NSX nodes.
The AD server details are configured correctly in NSX.
The AD server is running correctly.
There are no firewalls blocking access between the AD server and NSX nodes.

After fixing the connection issue, use the "TEST CONNECTION" in LDAP server UI to test the connection to AD server.

Errors during Delta Sync

Critical

Failure to Sync AD server error description

When event detected: Failure during selective sync of Identity Firewall AD server: error details.

When event detected: Selective sync errors of Identity Firewall AD server fixed.

Verify whether the load balancer service in the Edge node is running.
If the status of the load balancer service is not ready, move the Edge node into maintenance mode, then exit maintenance mode.
If the status of the load balancer service is still not recovered, please check whether there are any error log in syslog.