The following tables describe the events that trigger alarms, including alarm messages and recommended actions to resolve them. Any event with a severity greater than LOW triggers an alarm.
Alarm Management Events
Alarm management events arise from the NSX Manager and Global Manager nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Alarm Service Overloaded | Critical | The alarm service is overloaded. When event detected: "Due to heavy volume of alarms reported, the alarm service is temporarily overloaded. The NSX UI and GET /api/v1/alarm NSX API have stopped reporting new alarms. Syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of alarms are addressed, the alarm service starts reporting new alarms again." When event resolved: "The heavy volume of alarms has subsided and new alarms are being reported again." |
Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm, investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new alarms again. |
Heavy Volume of Alarms | Critical | Heavy volume of a specific alarm type detected. When event detected: "Due to heavy volume of {event_id} alarms, the alarm service has temporarily stopped reporting alarms of this type. The NSX UI and GET /api/v1/alarms NSX API are not reporting new instances of these alarms. Syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of {event_id} alarms are addressed, the alarm service starts reporting new {event_id} alarms when new issues are detected again." When event resolved: "The heavy volume of {event_id} alarms has subsided and new alarms of this type are being reported again." |
Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm, investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new {event_id} alarms again. |
Certificates Events
Certificate events arise from the NSX Manager node.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Certificate Expired | Critical | A certificate has expired. When event detected: "Certificate {entity-id} has expired." When event resolved: "The expired certificate {entity-id} has been removed or is no longer expired. |
Ensure services that are currently using the certificate are updated to use a new, non-expired certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:
where <cert-id> is the ID of a valid certificate reported by the API call After the expired certificate is no longer in use, it should be deleted with the following API call:
|
Certificate About to Expire | High | A certificate is about to expire. When event detected: "Certificate {entity-id} is about to expire." When event resolved: "The expiring certificate {entity-id} or is no longer about to expire." |
Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:
where <cert-id> is the ID of a valid certificate reported by the API call After the expiring certificate is no longer in use, it should be deleted using the API call:
|
Certificate Expiration Approaching | Medium | A certificate is approaching expiration. When event detected: "Certificate {entity-id} is approaching expiration." When event resolved: "The expiring certificate {entity-id} or is no longer approaching expiration." |
Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:
where <cert-id> is the ID of a valid certificate reported by the API call After the expiring certificate is no longer in use, it should be deleted using the API call:
|
CNI Health Events
CNI health events arise from the ESXi and KVM nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Hyperbus Manager Connection Down | Medium | Hyperbus cannot communicate with the Manager node. When event detected: "Hyperbus cannot communicate with the Manager node." When event resolved: "Hyperbus can communicate with the Manager node." |
The hyperbus vmkernel interface (vmk50) may be missing. See Knowledge Base article 67432. |
DHCP Events
DHCP events arise from the NSX Edge and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Pool Lease Allocation Failed | High | IP addresses in an IP Pool have been exhausted. When event detected: "The addresses in IP Pool {entity_id} of DHCP Server {dhcp_server_id} have been exhausted. The last DHCP request has failed and future requests will fail." When event resolved: "IP Pool {entity_id} of DHCP Server {dhcp_server_id} is no longer exhausted. A lease is successfully allocated to the last DHCP request." |
Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the page in the NSX UI. |
Pool Overloaded | Medium | An IP Pool is overloaded. When event detected: "DHCP Server {dhcp_server_id} IP Pool {entity_id} usage is approaching exhaustion with {dhcp_pool_usage}% IPs allocated." When event resolved: "The DHCP Server {dhcp_server_id} IP Pool {entity_id} has fallen below the high usage threshold." |
Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the page in the NSX UI. |
Distributed Firewall Events
Distributed firewall events arise from the NSX Manager or ESXi nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Distributed Firewall CPU Usage Very High | Critical | Distributed firewall CPU usage is very high. When event detected: "The DFW CPU usage on Transport node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "DNS forwarder {entity_id} is running again." |
Consider re-balancing the VM workloads on this host to other hosts. Please review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter. |
Distributed Firewall Memory Usage Very High | Critical | Distributed firewall memory usage is very high. When event detected: "The DFW memory usage {heap_type} on Transport Node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The DFW memory usage {heap_type} on Transport Node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%." |
View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the host. Consider re-balancing the workloads on this host to other hosts. |
DNS Events
DNS events arise from the NSX Edge and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Forwarder Down | High | A DNS forwarder is down. When event detected: "DNS forwarder {entity_id} is not running. This is impacting all configured DNS Forwarders that are currently enabled." When event resolved: "DNS forwarder {entity_id} is running again." |
|
Forwarder Disabled | High | A DNS forwarder is disabled. When event detected: "DNS forwarder {entity_id} is disenabled." When event resolved: ""DNS forwarder {entity_id} is enabled." |
|
Edge Health Events
Edge health events arise from the NSX Edge and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Edge CPU Usage Very High | Critical | Edge node CPU usage is very high. When event detected: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
Edge CPU Usage High | Medium | Edge node CPU usage is high. When event detected: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
Edge Datapath Configuration Failure | High | Edge node datapath configuration has failed. When event detected: "Failed to enable the datapath on the Edge node after three attempts." When event resolved: "Datapath on the Edge node has been successfully enabled." |
Ensure the Edge node connection to the Manager node is healthy. From the Edge node NSX CLI, invoke the command get services to check the health of services. If the dataplane service is stopped, invoke the command start service dataplane to restart it. |
Edge Datapath CPU Usage Very High | Critical | Edge node datapath CPU usage is very high. When event detected: "The datapath CPU usage on Edge node {entity-id} has reached {datapath_resource_usage}% which is at or above the very high threshold for at least two minutes." When event resolved: "Datapath CPU usage on Edge node {entity-id} has reduced below the maximum threshold." |
Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters. |
Edge Datapath CPU Usage High | Medium | Edge node datapath CPU usage is high. When event detected: "The datapath CPU usage on Edge node {entity-id} has reached {datapath_resource_usage}% which is at or above the high threshold for at least two minutes." When event resolved: "The CPU usage on Edge node {entity-id} has reached below the high threshold." |
Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters. |
Edge Datapath Crypto Driver Down | Critical | The Edge node datapath crypto driver is down. When event detected: "Edge node crypto driver is down." When event resolved: "Edge node crypto driver is up." |
Upgrade the Edge node as needed. |
Edge Datapath Memory Pool is High | Medium | The Edge node datapath memory pool is high. When event detected: "The datapath mempool usage for {mempool_name} on Edge node {entity-id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The datapath mempool usage for {mempool_name} on Edge node {entity-id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%." |
Log in as the root user and invoke the commands edge-appctl -t /var/run/vmware/edge/dpd.ctl mempool/show and edge-appctl -t /var/run/vmware/edge/dpd.ctl memory/show malloc_heap to check DPDK memory usage. |
Edge Disk Usage Very High | Critical | Edge node disk usage is very high. When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
Edge Disk Usage High | Medium | Edge node disk usage is high. When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
Edge Global ARP Table Usage High | Medium | The Edge node global ARP table usage is high. When event detected: "Global ARP table usage on Edge node {entity-id} has reached {datapath_resource_usage}% which is above the high threshold for over two minutes." When event resolved: "Global arp table usage on Edge node {entity-id} has reached below the high threshold." |
Increase the ARP table size:
|
Edge Memory Usage Very High | Critical | Edge node memory usage is very high. When event detected: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
Edge Memory Usage High | Medium | Edge node memory usage is high. When event detected: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
Edge NIC Link Status Down | Critical | Edge node NIC link is down. When event detected: "Edge node NIC {edge_nic_name} link is down." When event detected: "Edge node NIC {edge_nic_name} link is up." |
On the Edge node, confirm if the NIC link is physically down by invoking the NSX CLI command get interfaces. If it is down, verify the cable connection. |
Edge NIC Out of Receive Buffer | Critical | Edge node NIC receive descriptor ring buffer has no space left. When event detected: "Edge NIC {edge_nic_name} receive ring buffer has overflowed by {rx_ring_buffer_overflow_percentage}% on Edge node {entity-id} for over 60 seconds." When event resolved: "Edge NIC {edge_nic_name} receive ring buffer usage on Edge node {entity-id} is no longer overflowing." |
Invoke the NSX CLI command
get dataplane, and check the following:
|
Edge NIC Out of Transmit Buffer | Critical | Edge node NIC transmit descriptor ring buffer has no space left. When event detected: "Edge node NIC {edge_nic_name} transmit ring buffer has overflowed by {tx_ring_buffer_overflow_percentage}% on Edge node {entity-id} for over 60 seconds." When event resolved: "Edge node NIC {edge_nic_name} transmit ring buffer usage on Edge node {entity-id} is no longer overflowing." |
Invoke the NSX CLI command
get dataplane, and check the following:
|
Storage Error | Critical | Starting in NSX-T Data Center 3.0.1. The following disk partitions on the Edge node are in read-only mode: {disk_partition_name} . |
Examine the read-only partition to see if reboot resolves the issue, or the disk needs to be replaced. Refer to KB article https://kb.vmware.com/s/article/2146870. |
Endpoint Protection Events
Endpoint protection events arise from the NSX Manager or ESXi nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
EAM Status Down | Critical | ESX Agent Manager (EAM) service on a compute manager is down. When event detected: "ESX Agent Manager (EAM) service on compute manager {entity_id} is down." When event resolved: "ESX Agent Manager (EAM) service on compute manager {entity_id} is either up or compute manager {entity_id} has been removed." |
Restart the ESX Agent Manager (EAM) service:
|
Partner Channel Down | Critical | Host module and Partner SVM connection is down. When event detected: "The connection between host module and Partner SVM {entity_id} is down." When event resolved: "The connection between host module and Partner SVM {entity_id} is up." |
See Knowledge Base article 2148821 Troubleshooting NSX Guest Introspection and make sure that the Partner SVM identified by {entity_id} is reconnected to the host module. |
Federation Events
Federation events arise from the NSX Manager, NSX Edge, and the public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
LM to LM Synchronization Error |
High | Starting in NSX-T Data Center 3.0.1. The synchronization between {site_name}({site_id} and {remote_site_name}({remote_site_id} failed for more than 5 minutes. |
|
LM to LM Synchronization Warning | Medium | Starting in NSX-T Data Center 3.0.1. The synchronization between {site_name}({site_id} and {remote_site_name}({remote_site_id} failed. |
|
RTEP BGP Down | High | Starting in NSX-T Data Center 3.0.1. RTEP BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is down. Reason: {failure_reason}. |
|
High Availability Events
High availability events arise from the NSX Edge and public cloud gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Tier0 Gateway Failover | High | A tier0 gateway has failed over. When event detected: "The tier0 gateway {entity-id} failover from {previous_gateway_state} to {current_gateway_state}." When event resolved: "The tier0 gateway {entity-id} is now up." |
Determine the service that is down and restart it.
|
Tier1 Gateway Failover | High | A tier1 gateway has failed over. When event detected: "The tier1 gateway {entity-id} failover from {previous_gateway_state} to {current_gateway_state}." When event resolved: "The tier1 gateway {entity-id} is now up." |
Determine the service that is down and restart it.
|
Infrastructure Communication Events
Infrastructure communication events arise from the NSX Edge, KVM, ESXi, and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Edge Tunnels Down | Critical | An Edge node's tunnel status is down. When event detected: "Overall tunnel status of Edge node {entity_id} is down." When event resolved: "The tunnels of Edge node {entity_id} have been restored." |
|
Infrastructure Service Events
Infrastructure service events arise from the NSX Edge and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Edge Service Status Down | Critical | Edge service is down for at least one minute. When event detected: "The service {edge_service_name} is down for at least one minute." When event resolved: "The service {edge_service_name} is up." |
On the Edge node, verify the service hasn't exited due to an error by looking for core dump files in the /var/log/core directory. To confirm whether the service is stopped, invoke the NSX CLI command get services. If so, run |
Edge Service Status Changed | Low | Edge service status has changed. When event detected: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}." When event resolved: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}." |
On the Edge node, verify the service hasn't exited due to an error by looking for core dump files in the /var/log/core directory. To confirm whether the service is stopped, invoke the NSX CLI command get services. If so, run |
Intelligence Communication Events
NSX Intelligence communication events arise from the NSX Manager node, ESXi node, and NSX Intelligence appliance.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Transport node flow exporter disconnected | High | A Transport node is disconnected from its Intelligence node's messaging broker. Data collection is affected. When event detected: "The flow exporter on Transport node {entity-id} is disconnected from the Intelligence node's messaging broker. Data collection is affected." When event resolved: "The flow exporter on Transport node {entity-id} has reconnected to the Intelligence node's messaging broker." |
|
Control Channel to Transport Node Down | Critical | Control channel to Transport Node Down. When event detected: Controller service central_control_plane_id to Transport node {entity-id} down for atleast three minutes from Controller services point of view. When event resolved: Controller service central_control_plane_id restores connection to Transport node {entity-id} . |
|
Control Channel to Transport Node Down for too long |
Warning | Control channel to Transport Node Down for too long. When event detected: Controller service central_control_plane_id to Transport node {entity-id} down for atleast 15 minutes from Controller services point of view. When event resolved: Controller service central_control_plane_id restores connection to Transport node {entity-id}. |
|
Management Channel To Transport Node Down |
Critical |
Disconnection from Manager node to transport node. When event detected: When event resolved |
|
Manager Control Channel Down |
Critical | Manager to controller channel is down. When event detected: When event resolved: |
On the Manager node managernode (IP), invoke the following two NSX CLI commands:
|
Intelligence Health Events
NSX Intelligence health events arise from the NSX Manager node and NSX Intelligence appliance.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
CPU Usage Very High | Critical | Intelligence node CPU usage is very high. When event detected: "The CPU usage on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%." |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
CPU Usage High | Medium | Intelligence node CPU usage is high. When event detected: "The CPU usage on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%." |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
Memory Usage Very High | Critical | Intelligence node memory usage is very high. When event detected: "The memory usage on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%." |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
Memory Usage High | Medium | Intelligence node memory usage is high. When event detected: "The memory usage on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%." |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
Disk Usage Very High | Critical | Intelligence node disk usage is very high. When event detected: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%." |
Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed. |
Disk Usage High | Medium | Intelligence node disk usage is high. When event detected: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%." |
Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed. |
Data disk partition usage very high | Critical | Intelligence node data disk partition usage is very high. When event detected: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%. When event resolved: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%." |
Stop NSX Intelligence data collection until the disk usage is below the threshold. In the NSX UI, navigate to System Appliances NSX Intelligence Appliance. Then select . |
Data disk partition usage high | Medium | Intelligence node data disk partition usage is high. When event detected: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%. When event resolved: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%." |
Stop NSX Intelligence data collection until the disk usage is below the threshold. Examine the /data partition and see if there are any unexpected large files that can be removed. |
Node status degraded | High | Intelligence node status is degraded. When event detected: "Service {service_name}on NSX Intelligence node {intelligence_node_id} is not running." When event resolved: "Service {service_name}on NSX Intelligence node {intelligence_node_id} is running properly." |
Examine service status and health information with NSX CLI command get services in the NSX Intelligence node. Restart unexpected stopped services with NSX CLI command restart service <service-name>. |
License Events
License events arise from the NSX Manager node.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
License Expired | Critical | A license has expired. When event detected: "The license of type {license_edition_type} has expired." When event resolved: "The expired license of type {license_edition_type} has been removed, updated, or is no longer expired." |
Add a new, non-expired license:
|
License About to Expire | Medium | When event detected: "The license of type {license_edition_type} is about to expire." When event resolved: "The expiring license identified by {license_edition_type}has been removed, updated, or is no longer about to expire." |
Add a new, non-expired license:
|
Load Balancer Events
Load balancer events arise from the NSX Edge node.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Load Balancer CPU Very High | Medium | Load balancer CPU usage is very high. When event detected: "The CPU usage of load balancer {entity_id} is {system_resource_usage}%, which is higher than the very high threshold of {system_usage_threshold}%." When event resolved: "The CPU utilization of load balancer {entity_id} is {system_resource_usage}%, which is lower than the very high threshold of {system_usage_threshold}%." |
If the load balancer CPU utilization of is higher than {system_usage_threshold}%, the workload is too high for this load balancer. Rescale the load balancer service by changing the load balancer size from small to medium or from medium to large. If the CPU utilization of this load balancer is still high, consider adjusting the Edge appliance form factor size or moving load balancer services to other Edge nodes for the applicable workload. |
Load Balancer Status Down | Medium | Load balancer service is down. When event detected: "The load balancer service {entity_id} is down." When event resolved: "The load balancer service {entity_id} is up." |
Verify whether the load balancer service in the Edge node is running. If the status of the load balancer service is not ready, move the Edge node into maintenance mode, then exit maintenance mode. If the status of the load balancer service is still not recovered, please check whether there are any error log in syslog. |
Virtual Server Status Down | Medium | Load balancer virtual service is down. When event detected: "The load balancer virtual server {entity_id} is down." When event resolved: "The load balancer virtual server {entity_id} is up." |
Consult the load balancer pool to determine its status and verify its configuration. If incorrectly configured, reconfigure it and remove the load balancer pool from the virtual server then re-add it to the virtual server again. |
Pool Status Down | Medium | When event detected: "The load balancer pool {entity_id} status is down." When event resolved: "The load balancer pool {entity_id} status is up." |
When the health of the member is established, the pool member status is updated to healthy based on the Rise Count. |
Manager Health Events
NSX Manager health events arise from the NSX Manager node cluster.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Duplicate IP Address | Medium | Manager node's IP address is in use by another device. When event detected: "Manager node {entity_id} IP address {duplicate_ip_address} is currently being used by another device in the network." When event detected: "Manager node {entity_id} appears to no longer be using {duplicate_ip_address}." |
|
Manager CPU Usage Very High | Critical | Manager node CPU usage is very high. When event detected: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
Manager CPU Usage High | Medium | Starting in NSX-T Data Center 3.0.1. Manager node CPU usage is high. When event detected: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
Manager Memory Usage Very High | Critical | Starting in NSX-T Data Center 3.0.1. Manager node memory usage is very high. When event detected: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
Manager Memory Usage High | Medium | Manager node memory usage is high. When event detected: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
Manager Disk Usage Very High | Critical | Manager node disk usage is very high. When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
Manager Disk Usage High | Medium | Manager node disk usage is high. When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
Manager Configuration Disk Usage Very High | Critical | Manager node config disk usage is very high. When event detected: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of high disk usage by the NSX Datastore service under the /config/corfu directory." When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Examine the /config partition and see if there are any unexpected large files that can be removed. |
Manager Configuration Disk Usage High | Medium | Manager node config disk usage is high. When event detected: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk usage by the NSX Datastore service under the /config/corfu directory." When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Examine the /config partition and see if there are any unexpected large files that can be removed. |
Operations DB Disk Usage High |
Medium | The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk useage by the NSX Datastore service under the /nonconfig/corfu directory. |
Please run the following tool, and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig. |
Operations DB Disk Usage Very High | Critical | The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of rising disk useage by the NSX Datastore service under the /nonconfig/corfu directory. |
Please run the following tool, and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig. |
NCP Events
NSX Container Plug-in (NCP) events arise from the ESXi and KVM nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
NCP Plugin Down | Critical | Manager Node has detected the NCP is down or unhealthy. When event detected: "Manager node has detected the NCP is down or unhealthy." When event resolved: "Manager Node has detected the NCP is up or healthy again." |
To find the clusters which are having issues, invoke the NSX API: GET /api/v1/systemhealth/container-cluster/ncp/status to fetch all cluster statuses and determine the name of any clusters that report DOWN or UNKNOWN. Go to the NSX UI page to find the names of clusters that reported DOWN or UNKNOWN status and click the Nodes tab which lists all Kubernetes and PAS cluster members.
For Kubernetes cluster:
For PAS cluster:
|
Node Agents Health Events
Node agent health events arise from the ESXi and KVM nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Node Agents Down | High | The agents running inside the Node VM appear to be down. When event detected: "The agents running inside the node VM appear to be down." When event resolved: "The agents inside the Node VM are running." |
For ESX:
For KVM:
For both ESX and KVM:
|
Password Management Events
Password management events arise from the NSX Manager, NSX Edge, and the public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Password expired | Critical | User password has expired. When event detected: "The password for user {username} has expired." When event resolved: "The password for the user {username} has been changed successfully or is no longer expired." |
The password for the user {username} must be changed now to access the system. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:
where |
Password about to expire | High | User password is about to expire. When event detected: "The password for user {username} is about to expire in {password_expiration_days} days."" When event resolved: "The password for the user {username} has been changed successfully or is no longer about to expire." |
Ensure the password for the user identified by {username} is changed immediately. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:
where |
Password expiration approaching | Medium | User password is approaching expiration. When event detected: "The password for user {username} is about to expire in {password_expiration_days} days." When event resolved: "The password for the user {username} has been changed successfully or is no longer about to expire." |
The password for the user identified by {username} needs to be changed soon. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:
where |
Routing Events
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
BGP Down | High | BGP neighbor down. When event detected: "In Router {entity_id}, BGP neighbor {bgp_neighbor_ip} is down, reason: {failure_reason}." When event resolved: "In Router {entity_id}, BGP neighbor {bgp_neighbor_ip} is up." |
|
Bidirectional Forwarding Detection Down (BFD) on External Interface |
High | BFD session is down. When event detected: "In router {entity_id}, BFD session for peer {peer_address} is down." When event resolved: "In router {entity_id}, BFD session for peer {peer_address} is up." |
|
Routing Down | High | All BGP/BFD sessions are down. When event detected: "All BGP/BFD sessions are down." When event resolved: "At least one BGP/BFD sessions up." |
|
Static Routing Removed | High | Static route removed. When event detected: "In router {entity_id}, static route {static_address} was removed because BFD was down." When event resolved: "In router {entity_id}, static route {static_address} was re-added as BFD recovered." |
|
Transport Node Health
Transport node health events arise from the KVM and ESXi nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
LAG Member Down | Medium | LACP reporting member down. When event detected: "LACP reporting member down." When event resolved: "LACP reporting member up." |
Check the connection status of LAG members on hosts.
|
N-VDS Uplink Down | Medium | Uplink is going down. When event detected: "Uplink is going down." When event resolved: "Uplink is going up." |
Check the physical NICs status of uplinks on hosts.
|
VPN Events
VPN events arise from the NSX Edge and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
IPsec Policy-Based Session Down | Medium | Policy-based IPsec VPN session is down. When event detected: "The policy-based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}." When event resolved: "The policy-based IPsec VPN session {entity_id} is up. |
Check IPsec VPN session configuration and resolve errors based on the session down reason. |
IPsec Route-Based Session Down | Medium | Route-based IPsec VPN session is down. When event detected: "The route-based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}." When event resolved: "The route-based IPsec VPN session {entity_id} is up." |
Check IPsec VPN session configuration and resolve errors based on the session down reason. |
IPsec Policy-Based Tunnel Down | Medium | Policy-based IPsec VPN tunnels are down. When event detected: "One or more policy-based IPsec VPN tunnels in session {entity_id} are down." When event resolved: "All policy-based IPsec VPN tunnels in session {entity_id} are up." |
Check IPsec VPN session configuration and resolve errors based on the tunnel down reason. |
IPsec Route-Based Tunnel Down | Medium | Route-based IPsec VPN tunnels are down. When event detected: "One or more route-based IPsec VPN tunnels in session {entity_id} are down." When event resolved: "All route-based IPsec VPN tunnels in session {entity_id} are up." |
Check IPsec VPN session configuration and resolve errors based on the tunnel down reason. |
L2VPN Session Down | Medium | L2VPN session is down. When event detected: "The L2VPN session {entity_id} is down." When event resolved: "The L2VPN session {entity_id} is up." |
Check IPsec VPN session configuration and resolve errors based on the reason. |
Identity Firewall Events
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Connectivity to AD server |
Critical |
Connectivity to AD server is lost. When event detected: Connectivity to Identity Firewall AD server is down. When event detected: Connectivity to Identity Firewall AD server is up . |
After fixing the connection issue, use the "TEST CONNECTION" in LDAP server UI to test the connection to AD server. |
Errors during Delta Sync |
Critical | Failure to Sync AD server error description When event detected: Failure during selective sync of Identity Firewall AD server: error details. When event detected: Selective sync errors of Identity Firewall AD server fixed. |
|