The following tables describe the events that trigger alarms, including alarm messages and recommended actions to resolve them. Any event with a severity greater than LOW triggers an alarm.
Alarm Management Events
Alarm management events arise from the NSX Manager and Global Manager nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Alarm Service Overloaded | Critical | The alarm service is overloaded. When event detected: "Due to heavy volume of alarms reported, the alarm service is temporarily overloaded. The NSX UI and When event resolved: "The heavy volume of alarms has subsided and new alarms are being reported again." |
Review all active alarms using the Alarms page in the NSX UI or using the |
Heavy Volume of Alarms | Critical | Heavy volume of a specific alarm type detected. When event detected: "Due to heavy volume of {event_id} alarms, the alarm service has temporarily stopped reporting alarms of this type. The NSX UI and GET /api/v1/alarms NSX API are not reporting new instances of these alarms. Syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of {event_id} alarms are addressed, the alarm service starts reporting new {event_id} alarms when new issues are detected again." When event resolved: "The heavy volume of {event_id} alarms has subsided and new alarms of this type are being reported again." |
Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm, investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new {event_id} alarms again. |
Audit Log Health Events
Audit log health events arise from the NSX Manager and Global Manager nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Audit Log Health |
Critical | At least one of the monitored log files cannot be written to. When event detected: "At least one of the monitored log files has read-only permissions or has incorrect user/group ownership or rsyslog.log is missing on Manager, Global Manager, Edge or Public Cloud Gateway nodes." When event resolved: "All monitored log files have the correct file permissions and ownership and rsyslog.log exists on Manager, Global Manager, Edge or Public Cloud Gateway nodes." |
|
Remote Logging Server Error |
Critical | Log messages undeliverable due to incorrect remote logging server configuration. When event detected: "Log messages to logging server {hostname_or_ip_address_with_port} ({entity_id}) cannot be delivered possibly due to an unresolvable FQDN, an invalid TLS certificate or missing NSX appliance iptables rule." When event resolved: "Configuration for logging server {hostname_or_ip_address_with_port} ({entity_id}) appear correct." |
To learn more about how to configure NSX-T Data Center appliances and hypervisors to send log messages to a remote logging server, see Configure Remote Logging. If logs are not received by the remote log server, see Troubleshooting Syslog Issues. |
Capacity Events
The following events can trigger alarms when the current inventory of certain categories of objects reaches a certain level. For more information, see View the Usage and Capacity of Categories of Objects.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Maximum Capacity | Critical | A maximum capacity has been breached. When event detected: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is at or above the maximum supported count of {max_supported_capacity_count}." When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is below the maximum supported count of {max_supported_capacity_count}.", |
|
Maximum Capacity Threshold | High | A maximum capacity threshold has been breached. When event detected: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is at or above the maximum capacity threshold of {max_capacity_threshold}%." When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is below the maximum capacity threshold of {max_capacity_threshold}%." |
Navigate to the capacity page in the NSX UI and review current usage versus threshold limits. If the current usage is expected, consider increasing the maximum threshold values. If the current usage is unexpected, review the network policies configured to decrease usage below the maximum threshold. |
Minimum Capacity Threshold | Medium | A minimum capacity threshold has been breached. When event detected: The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is at or above the minimum capacity threshold of {min_capacity_threshold}%." When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is below the minimum capacity threshold of {min_capacity_threshold}%." |
Navigate to the capacity page in the NSX UI and review current usage versus threshold limits. If the current usage is expected, consider increasing the minimum threshold values. If the current usage is unexpected, review the network policies configured to decrease usage below the minimum threshold. |
Certificates Events
Certificate events arise from the NSX Manager node.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Certificate Expired | Critical | A certificate has expired. When event detected: "Certificate {entity-id} has expired." When event resolved: "The expired certificate {entity-id} has been removed or is no longer expired. |
Ensure services that are currently using the certificate are updated to use a new, non-expired certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:
where <cert-id> is the ID of a valid certificate reported by the API call After the expired certificate is no longer in use, it should be deleted with the following API call:
|
Certificate Is About To Expire | High | A certificate is about to expire. When event detected: "Certificate {entity-id} is about to expire." When event resolved: "The expiring certificate {entity-id} or is no longer about to expire." |
Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:
where <cert-id> is the ID of a valid certificate reported by the API call After the expiring certificate is no longer in use, it should be deleted using the API call:
|
Certificate Expiration Approaching | Medium | A certificate is approaching expiration. When event detected: "Certificate {entity-id} is approaching expiration." When event resolved: "The expiring certificate {entity-id} or is no longer approaching expiration." |
Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the following API call:
where <cert-id> is the ID of a valid certificate reported by the API call After the expiring certificate is no longer in use, it should be deleted using the API call:
|
CNI Health Events
CNI health events arise from the ESXi and KVM nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Hyperbus Manager Connection Down | Medium | Hyperbus cannot communicate with the Manager node. When event detected: "Hyperbus cannot communicate with the Manager node." When event resolved: "Hyperbus can communicate with the Manager node." |
The hyperbus vmkernel interface (vmk50) may be missing. See Knowledge Base article 67432. |
DHCP Events
DHCP events arise from the NSX Edge and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Pool Lease Allocation Failed | High | IP addresses in an IP Pool have been exhausted. When event detected: "The addresses in IP Pool {entity_id} of DHCP Server {dhcp_server_id} have been exhausted. The last DHCP request has failed and future requests will fail." When event resolved: "IP Pool {entity_id} of DHCP Server {dhcp_server_id} is no longer exhausted. A lease is successfully allocated to the last DHCP request." |
Review the DHCP pool configuration in the NSX UI or on the edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the page in the NSX UI. |
Pool Overloaded | Medium | An IP Pool is overloaded. When event detected: "DHCP Server {dhcp_server_id} IP Pool {entity_id} usage is approaching exhaustion with {dhcp_pool_usage}% IPs allocated." When event resolved: "The DHCP Server {dhcp_server_id} IP Pool {entity_id} has fallen below the high usage threshold." |
Review the DHCP pool configuration in the NSX UI or on the edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the page in the NSX UI. |
Distributed Firewall Events
Distributed firewall events arise from the NSX Manager or ESXi nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
DFW CPU Usage Very High |
Critical | DFW CPU usage is very high. When event detected: "The DFW CPU usage on Transport node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The DFW CPU usage on Transport node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%." |
Consider re-balancing the VM workloads on this host to other hosts. Please review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter. |
DFW Memory Usage Very High |
Critical | DFW Memory usage is very high. When event detected: "The DFW memory usage {heap_type} on Transport Node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The DFW memory usage {heap_type} on Transport Node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%." |
View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the host. Consider re-balancing the workloads on this host to other hosts. |
Distributed IDS/IPS Events
Distributed IDS/IPS events arise from the NSX Manager or ESXi nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
NSX IDPS Engine CPU Usage Very High |
Critical | NSX-IDPS engine CPU usage exceeded 95% or above. When event detected: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is at or above the very high threshold value of 95%." When event resolved: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is below the very high threshold value of 95%." |
Consider re-balancing the VM workloads on this host to other hosts. |
NSX IDPS Engine Down |
Critical | NSX IDPS is enabled via NSX Policy and IDPS rules are configured, but NSX-IDPS engine is down. When event detected: "NSX IDPS is enabled via NSX policy and IDPS rules are configured, but NSX-IDPS engine is down." When event resolved: "NSX IDPS is in one of the cases below. 1. NSX IDPS is disabled via NSX policy. 2. NSX IDPS engine is enabled, NSX-IDPS engine and vdpi are up, and NSX IDPS has been enabled and IDPS rules are configured via NSX Policy." |
|
NSX IDPS Engine Memory Usage Very High |
Critical | NSX-IDPS engine memory usage reaches 95% or above. When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the very high threshold value of 95%." When event resolved: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is below the very high threshold value of 95%." |
Consider re-balancing the VM workloads on this host to other hosts. |
DNS Events
DNS events arise from the NSX Edge and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Forwarder Down | High | A DNS forwarder is down. When event detected: "DNS forwarder {entity_id} is not running. This is impacting the identified DNS Forwarder that is currently enabled." When event resolved: "DNS forwarder {entity_id} is running again." |
|
Forwarder Disabled
Note: Alarm deprecated starting from
NSX-T Data Center 3.2.
|
Low | A DNS forwarder is disabled. When event detected: "DNS forwarder {entity_id} is disenabled." When event resolved: ""DNS forwarder {entity_id} is enabled." |
|
Edge Events
Edge events arise when there is a mismatch between NSX and Edge Appliance for some configuration values of the edge transport node.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Edge Node Settings Mismatch |
Critical | Edge node settings mismatch. When event detected: "The edge node {entity_id} settings configuration does not match the policy intent configuration. The edge node configuration visible to the user on UI or API is not the same as what is realized. The realized edge node changes made by the user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for the edge node are listed in the runtime data." When event resolved: "Edge node {entity_id} node settings are consistent with policy intent now." |
Review the node settings of this edge transport node
{entity_id}. Perform one of the following actions to resolve the alarm.
|
Edge Vm vSphere Settings Mismatch |
Critical | Edge VM vSphere settings mismatch. When event detected: "The edge node {entity_id} configuration on vSphere does not match the policy intent configuration. The edge node configuration visible to the user on UI or API is not the same as what is realized. The realized edge node changes made by the user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for the edge node are listed in the runtime data." When event resolved: "Edge node {entity_id} VM vSphere settings are consistent with policy intent now." |
Review the vSphere configuration of this edge transport node
{entity_id}. Perform one of the following actions to resolve the alarm.
|
Edge Node Settings And vSphere Settings Are Changed |
Critical | Edge node settings and vSphere settings are changed. When event detected: "The edge node {entity_id} settings and vSphere configuration are changed and does not match the policy intent configuration. The edge node configuration visible to the user on UI or API is not the same as what is realized. The realized edge node changes made by the user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for edge node settings and vSphere configuration are listed in the runtime data." When event resolved: "Edge node {entity_id} node settings and vSphere settings are consistent with policy intent now." |
Review the node settings and vSphere configuration of this edge transport node
{entity_id}. Perform one of the following actions to resolve the alarm.
|
Edge vSphere Location Mismatch |
High | Edge vSphere location mismatch. When event detected: "The edge node {entity_id} has been moved using vMotion. The edge node {entity_id} configuration on vSphere does not match the policy intent configuration. The edge node configuration visible to the user on UI or API is not the same as what is realized. The realized edge node changes made by the user outside of NSX Manager are shown in the details of this alarm. Fields that differ for the edge node are listed in the runtime data" When event resolved: "Edge node {entity_id} node vSphere settings are consistent with policy intent now." |
Review the vSphere configuration of this edge transport node
{entity_id}. Perform one of the following actions to resolve alarm.
|
Edge Health Events
Edge health events arise from the NSX Edge and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Edge CPU Usage Very High | Critical | Edge node CPU usage is very high. When event detected: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this edge node. Consider adjusting the edge appliance form factor size or rebalancing services to other edge nodes for the applicable workload. |
Edge CPU Usage High | Medium | Edge node CPU usage is high. When event detected: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this edge node. Consider adjusting the edge appliance form factor size or rebalancing services to other edge nodes for the applicable workload. |
Edge Datapath Configuration Failure | High | Edge node datapath configuration failed. When event detected: "Failed to enable the datapath on the Edge node after three attempts." When event resolved: "Datapath on the Edge node has been successfully enabled." |
Ensure the edge node connection to the Manager node is healthy. From the edge node NSX CLI, invoke the command get services to check the health of services. If the dataplane service is stopped, invoke the command start service dataplane to restart it. |
Edge Datapath CPU Very High |
Critical | Edge node datapath CPU usage is very high. When event detected: "The datapath CPU usage on Edge node {entity-id} has reached {datapath_resource_usage}% which is at or above the very high threshold for at least two minutes." When event resolved: "Datapath CPU usage on Edge node {entity-id} has reduced below the maximum threshold." |
Review the CPU statistics on the edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. Consider increasing the edge appliance form factor size and rebalancing services on this edge node to other edge nodes in the same cluster or other edge clusters. |
Edge Datapath CPU Usage High | Medium | Edge node datapath CPU usage is high. When event detected: "The datapath CPU usage on Edge node {entity-id} has reached {datapath_resource_usage}% which is at or above the high threshold for at least two minutes." When event resolved: "The CPU usage on Edge node {entity-id} has reached below the high threshold." |
Review the CPU statistics on the edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. Consider increasing the edge appliance form factor size and rebalancing services on this edge node to other edge nodes in the same cluster or other edge clusters. |
Edge Datapath Cryptodrv Down |
Critical | Edge node crypto driver is down When event detected: "Edge node crypto driver {edge_crypto_drv_name} is down." When event resolved: "Edge node crypto driver {edge_crypto_drv_name} is up." |
Upgrade the edge node as needed. |
Edge Datapath Mempool High |
Medium | Edge node datapath mempool is high. When event detected: "The datapath mempool usage for {mempool_name} on Edge node {entity-id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The datapath mempool usage for {mempool_name} on Edge node {entity-id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%." |
Log in as the root user and invoke the commands edge-appctl -t /var/run/vmware/edge/dpd.ctl mempool/show and edge-appctl -t /var/run/vmware/edge/dpd.ctl memory/show malloc_heap to check DPDK memory usage. |
Edge Disk Usage Very High | Critical | Edge node disk usage is very high. When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
Edge Disk Usage High | Medium | Edge node disk usage is high. When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
Edge Global ARP Table Usage High | Medium | The Edge node global ARP table usage is high. When event detected: "Global ARP table usage on edge node {entity-id} has reached {datapath_resource_usage}% which is above the high threshold for over two minutes." When event resolved: "Global ARP table usage on Edge node {entity-id} has reached below the high threshold." |
|
Edge Memory Usage Very High | Critical | Edge node memory usage is very high. When event detected: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this edge node. Consider adjusting the edge appliance form factor size or rebalancing services to other edge nodes for the applicable workload. |
Edge Memory Usage High | Medium | Edge node memory usage is high. When event detected: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on the Edge node {entity-id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this edge node. Consider adjusting the edge appliance form factor size or rebalancing services to other edge nodes for the applicable workload. |
Edge NIC Link Status Down | Critical | Edge node NIC link is down. When event detected: "Edge node NIC {edge_nic_name} link is down." When event detected: "Edge node NIC {edge_nic_name} link is up." |
On the edge node, confirm if the NIC link is physically down by invoking the NSX CLI command get interfaces. If it is down, verify the cable connection. |
Edge NIC Out of Receive Buffer | Medium | Edge node NIC is out of RX ring buffers temporarily. When event detected: "Edge NIC {edge_nic_name} receive ring buffer has overflowed by {rx_ring_buffer_overflow_percentage}% on Edge node {entity_id}. The missed packet count is {rx_misses} and processed packet count is {rx_processed}." When event resolved: "Edge NIC {edge_nic_name} receive ring buffer usage on Edge node {entity_id} is no longer overflowing." |
|
Edge NIC Out of Transmit Buffer | Critical | Edge node NIC is out of TX ring buffers temporarily. When event detected: "Edge NIC {edge_nic_name} transmit ring buffer has overflowed by {tx_ring_buffer_overflow_percentage}% on Edge node {entity_id}. The missed packet count is {tx_misses} and processed packet count is {tx_processed}." When event resolved: "Edge NIC {edge_nic_name} transmit ring buffer usage on Edge node {entity_id} is no longer overflowing." |
|
Storage Error | Critical | Starting in NSX-T Data Center 3.0.1. When event detected: "The following disk partitions on the Edge node are in read-only mode: {disk_partition_name}." When event resolved: "The following disk partitions on the Edge node have recovered from read-only mode: {disk_partition_name}" |
Examine the read-only partition to see if reboot resolves the issue or the disk needs to be replaced. Contact GSS for more information. |
Edge Datapath NIC Throughput High |
Medium | Edge node datapath NIC throughput is high. When event detected: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is at or above the high threshold value of {nic_throughput_threshold}%." When event resolved: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is below the high threshold value of {nic_throughput_threshold}%." |
Examine the traffic throughput levels on the NIC, and determine whether configuration changes are needed. Run the following command to monitor throughput. get dataplane throughput <seconds> |
Edge Datapath NIC Throughput Very High |
Critical | Edge node datapath NIC throughput is very high. When event detected: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is at or above the very high threshold value of {nic_throughput_threshold}%." When event resolved: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is below the very high threshold value of {nic_throughput_threshold}%." |
Examine the traffic throughput levels on the NIC, and determine whether configuration changes are needed. Invoke the following NSX CLI command to monitor throughput. get dataplane throughput <seconds> |
Failure Domain Down |
Critical | All members of failure domain are down. When event detected: "All members of failure domain {transport_node_id} are down." When event resolved: "All members of failure domain {transport_node_id} are reachable." |
|
Datapath Thread Deadlocked |
Critical | Edge node's datapath thread is in deadlock condition. When event detected: "Edge node datapath thread {edge_thread_name} is deadlocked." When event resolved: "Edge node datapath thread {edge_thread_name} is free from deadlock." |
Restart the dataplane service by invoking the following NSX CLI command. restart service dataplane |
Endpoint Protection Events
Endpoint protection events arise from the NSX Manager or ESXi nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
EAM Status Down | Critical | ESX Agent Manager (EAM) service on a compute manager is down. When event detected: "ESX Agent Manager (EAM) service on compute manager {entity_id} is down." When event resolved: "ESX Agent Manager (EAM) service on compute manager {entity_id} is either up or compute manager {entity_id} has been removed." |
Restart the ESX Agent Manager (EAM) service:
|
Partner Channel Down | Critical | Host module and Partner SVM connection is down. When event detected: "The connection between host module and Partner SVM {entity_id} is down." When event resolved: "The connection between host module and Partner SVM {entity_id} is up." |
See Knowledge Base article 2148821 Troubleshooting NSX Guest Introspection and make sure that the Partner SVM identified by {entity_id} is reconnected to the host module. |
Gateway Firewalls Events
Gateway firewalls events arise from NSX Edge nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
ICMP Flow Count Exceeded |
Critical | Starting in NSX-T Data Center 3.1.3. The gateway firewall flow table for ICMP traffic has exceeded the set threshold. New flows will be dropped by the gateway firewall when usage reaches the maximum limit. When event detected: “Gateway firewall flow table usage for ICMP traffic on logical router {entity_id} has reached {firewall_icmp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.” When event resolved: “Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%.” |
|
ICMP Flow Count High | Medium | Starting in NSX-T Data Center 3.1.3. The gateway firewall flow table usage for ICMP traffic is high. New flows will be dropped by the gateway firewall when usage reaches the maximum limit. When event detected: “Gateway firewall flow table usage for ICMP on logical router {entity_id} has reached {firewall_icmp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.” When event resolved: “Gateway firewall flow table usage for ICMP on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%.” |
|
IP Flow Count Exceeded | Critical | Starting in NSX-T Data Center 3.1.3. The gateway firewall flow table for IP traffic has exceeded the set threshold. New flows will be dropped by the gateway firewall when usage reaches the maximum limit. When event detected: “Gateway firewall flow table usage for IP traffic on logical router {entity_id} has reached {firewall_ip_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.” When event resolved: “Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%." |
|
IP Flow Count High | Medium | Starting in NSX-T Data Center 3.1.3. The gateway firewall flow table usage for IP traffic is high. New flows will be dropped by the gateway firewall when usage reaches the maximum limit When event detected: “Gateway firewall flow table usage for IP on logical router {entity_id} has reached {firewall_ip_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit.” When event resolved: “Gateway firewall flow table usage for non IP flows on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%.” |
|
TCP Flow Count Exceeded | Critical | Starting in NSX-T Data Center 3.1.3. The gateway firewall flow table for TCP half-open traffic has exceeded the set threshold. New flows will be dropped by the gateway firewall when usage reaches the maximum limit. When event detected: “Gateway firewall flow table usage for TCP half-open traffic on logical router {entity_id} has reached {firewall_halfopen_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.” When event resolved: “Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%.” |
|
TCP Flow Count High | Medium | Starting in NSX-T Data Center 3.1.3. The gateway firewall flow table usage for TCP half-open traffic is high. New flows will be dropped by the gateway firewall when usage reaches the maximum limit. When event detected: “Gateway firewall flow table usage for TCP on logical router {entity_id} has reached {firewall_halfopen_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.” When event resolved: “Gateway firewall flow table usage for TCP half-open on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%.” |
|
UDP Flow Count Exceeded | Critical | Starting in NSX-T Data Center 3.1.3. The gateway firewall flow table for UDP traffic has exceeded the set threshold. New flows will be dropped by the gateway firewall when usage reaches the maximum limit. When event detected: “Gateway firewall flow table usage for UDP traffic on logical router {entity_id} has reached {firewall_udp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.” When event resolved: “Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold.” |
|
UDP Flow Count High | Medium | Starting in NSX-T Data Center 3.1.3. The gateway firewall flow table usage for UDP traffic is high. New flows will be dropped by the gateway firewall when usage reaches the maximum limit. When event detected: “Gateway firewall flow table usage for UDP on logical router {entity_id} has reached {firewall_udp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by the gateway firewall when usage reaches the maximum limit.” When event resolved: “Gateway firewall flow table usage for UDP on logical router {entity_id} has reached below the high threshold." |
|
High Availability Events
High availability events arise from the NSX Edge and public cloud gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Tier0 Gateway Failover | High | A tier0 gateway has failed over. When event detected: "The tier0 gateway {entity-id} failover from {previous_gateway_state} to {current_gateway_state}." When event resolved: "The tier0 gateway {entity-id} is now up." |
|
Tier1 Gateway Failover | High | A tier1 gateway has failed over. When event detected: "The tier1 gateway {entity_id} failover from {previous_gateway_state} to {current_gateway_state}, service-router {service_router_id}." When event resolved: "The tier1 gateway {entity-id} is now up." |
|
Identity Firewall Events
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Connectivity to LDAP Server Lost | Critical |
Connectivity to LDAP server is lost. When event detected: "The connectivity to LDAP server {ldap_server} is lost." When event detected: "The connectivity to LDAP server {ldap_server} is restored. |
Perform the following steps to check the LDAP server connectivity:
After the issue is fixed, use TEST CONNECTION in NSX UI under Identity Firewall AD to test the connection. |
Error In Delta Sync |
Critical | Errors occurred while performing delta sync. When event detected: "Errors occurred while performing delta sync with {directory_domain}." When event detected: "No errors occurred while performing delta sync with {directory_domain}." |
|
Infrastructure Communication Events
Infrastructure communication events arise from the NSX Edge, KVM, ESXi, and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Edge Tunnels Down | Critical | An Edge node's tunnel status is down. When event detected: "The overall tunnel status of Edge node {entity_id} is down." When event resolved: "The tunnels of Edge node {entity_id} have been restored." |
|
Intelligence Communication Events
NSX Intelligence communication events arise from the NSX Manager node, ESXi node, and NSX Intelligence appliance.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Transport Node Flow Exporter Disconnected | High | A Transport node is disconnected from its Intelligence node's messaging broker. Data collection is affected. When event detected: "The flow exporter on Transport node {entity-id} is disconnected from the Intelligence node's messaging broker. Data collection is affected." When event resolved: "The flow exporter on Transport node {entity-id} has reconnected to the Intelligence node's messaging broker." |
|
Control Channel to Transport Node Down | Medium | Controller service to transport node's connection is down. When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Transport node {entity_id} down for at least three minutes from Controller service's point of view." When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Transport node {entity_id}." |
|
Control Channel to Transport Node Down Long |
Critical | Controller service to Transport node's connection is down for too long. When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Transport node {entity_id} down for at least 15 minutes from Controller service's point of view." When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Transport node {entity_id}." |
|
Control Channel To Manager Node Down |
Medium | Transport node's control plane connection to the Manager node is down. When event detected: "The Transport node {entity_id} control plane connection to Manager node {appliance_address} is down for at least {timeout_in_minutes} minutes from the Transport node's point of view." When event resolved: "The Transport node {entity_id} restores the control plane connection to Manager node {appliance_address}." |
|
Control Channel To Manager Node Down Too Long |
Critical | Transport node's control plane connection to the Manager node is down for long. When event detected: "The Transport node {entity_id} control plane connection to Manager node {appliance_address} is down for at least {timeout_in_minutes} minutes from the Transport node's point of view." When event resolved: "The Transport node {entity_id} restores the control plane connection to Manager node {appliance_address}." |
|
Management Channel To Transport Node Down |
Medium | Management channel to Transport node is down. When event detected: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is down for 5 minutes." When event resolved: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is up." |
|
Management Channel To Transport Node Down Long |
Critical | Management channel to Transport node is down for too long. When event detected: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is down for 15 minutes." When event resolved: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is up." |
|
Manager Cluster Latency High |
Medium | The average network latency between Manager nodes is high. When event detected: "The average network latency between Manager nodes {manager_node_id} ({appliance_address}) and {remote_manager_node_id} ({remote_appliance_address}) is more than 10ms for the last 5 minutes." When event resolved: "The average network latency between Manager nodes {manager_node_id} ({appliance_address}) and {remote_manager_node_id} ({remote_appliance_address}) is within 10ms." |
Ensure there are no firewall rules blocking ping traffic between manager nodes. If there are other high bandwidth servers and applications sharing the local network, consider moving these to a different network. |
Manager Control Channel Down |
Critical | Manager to controller channel is down. When event detected: "The communication between the management function and the control function has failed on Manager node {manager_node_name} ({appliance_address})." When event resolved: "The communication between the management function and the control function has been restored on Manager node {manager_node_name} ({appliance_address})." |
On the manager node {manager_node_name} ({appliance_address}), invoke the following two NSX CLI commands: restart service mgmt-plane-bus restart service manager |
Manager FQDN Lookup Failure |
Critical | DNS lookup failed for Manager node's FQDN. When event detected: "DNS lookup failed for Manager node {entity_id} with FQDN {appliance_fqdn} and the publish_fqdns flag was set." When event resolved: "FQDN lookup succeeded for Manager node {entity_id} with FQDN {appliance_fqdn} or the publish_fqdns flag was cleared." |
|
Manager FQDN Reverse Lookup Failure |
Critical | Reverse DNS lookup failed for Manager node's IP address. When event detected: "Reverse DNS lookup failed for Manager node {entity_id} with IP address {appliance_address} and the publish_fqdns flag was set." When event resolved: "Reverse DNS lookup succeeded for Manager node {entity_id} with IP address {appliance_address} or the publish_fqdns flag was cleared." |
|
Management Channel To Manager Node Down | Medium | Management channel to Manager node is down. When event detected:"Management channel to Manager Node {manager_node_id} ({appliance_address}) is down for 5 minutes." When event resolved: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is up." |
|
Management Channel To Manager Node Down Long | Critical | Management channel to Manager node is down for too long. When event detected: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is down for 15 minutes."When event resolved: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is up." |
|
Infrastructure Service Events
Infrastructure service events arise from the NSX Edge and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Edge Service Status Down
Note: Alarm deprecated starting from
NSX-T Data Center 3.2.
|
Critical | Edge service is down for at least one minute. If the View Runtime Details link is available, you can click the link to view the reason for service down. When event detected: "The service {edge_service_name} is down for at least one minute." When event resolved: "The service {edge_service_name} is up." |
|
Edge Service Status Changed | Medium | Edge service status has changed. If the View Runtime Details link is available, you can click the link to view the reason for service down. When event detected: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}." When event resolved: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}." |
|
Intelligence Health Events
NSX Intelligence health events arise from the NSX Manager node and NSX Intelligence appliance.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
CPU Usage Very High | Critical | Intelligence node CPU usage is very high. When event detected: "The CPU usage on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%." |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
CPU Usage High | Medium | Intelligence node CPU usage is high. When event detected: "The CPU usage on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%." |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
Memory Usage Very High | Critical | Intelligence node memory usage is very high. When event detected: "The memory usage on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%." |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
Memory Usage High | Medium | Intelligence node memory usage is high. When event detected: "The memory usage on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%." |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
Disk Usage Very High | Critical | Intelligence node disk usage is very high. When event detected: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%." |
Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed. |
Disk Usage High | Medium | Intelligence node disk usage is high. When event detected: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage of disk partition {disk_partition_name} on the NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%." |
Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed. |
Data Disk Partition Usage Very High | Critical | Intelligence node data disk partition usage is very high. When event detected: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%. When event resolved: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%." |
Stop NSX Intelligence data collection until the disk usage is below the threshold. In the NSX UI, navigate to System Appliances NSX Intelligence Appliance. Then select . |
Data Disk Partition Usage High | Medium | Intelligence node data disk partition usage is high. When event detected: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%. When event resolved: "The disk usage of disk partition /data on NSX Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%." |
Stop NSX Intelligence data collection until the disk usage is below the threshold. Examine the /data partition and see if there are any unexpected large files that can be removed. |
Node Status Degraded | High | Intelligence node status is degraded. When event detected: "Service {service_name}on NSX Intelligence node {intelligence_node_id} is not running." When event resolved: "Service {service_name}on NSX Intelligence node {intelligence_node_id} is running properly." |
Examine service status and health information with NSX CLI command get services in the NSX Intelligence node. Restart unexpected stopped services with NSX CLI command restart service <service-name>. |
IP Address Management Events
IP address management (IPAM) events arise from the NSX Manager nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
IP Block Usage Very High | Medium | Starting in NSX-T Data Center 3.1.2. IP subnet usage of an IP block has reached 90%. When event detected: "IP block usage of <intent_path> is very high. IP block nearing its total capacity, creation of subnet using IP block might fail." When event resolved: No message. |
Note: Delete an IP pool or subnet only if it does not have any allocated IPs and if it will not be used in future.
|
IP Pool Usage Very High | Medium | Starting in NSX-T Data Center 3.1.2. IP allocation usage of an IP pool has reached 90%. When event detected: "IP pool usage of <intent_path> is very high. IP pool nearing its total capacity. Creation of entity/service depends on IP being allocated from IP pool might fail." When event resolved: No message. |
Review IP pool usage. Release unused IP allocations from the IP pool or create a new IP pool.
You can release those IPs that are not used. To release unused IP allocations, invoke the following NSX API.
|
License Events
License events arise from the NSX Manager node.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
License Expired | Critical | A license has expired. When event detected: "The license of type {license_edition_type} has expired." When event resolved: "The expired license of type {license_edition_type} has been removed, updated, or is no longer expired." |
Add a new, non-expired license:
|
License Is About to Expire | Medium | "A license is about to expired.When event detected: "The license of type {license_edition_type} is about to expire." When event resolved: "The expiring license identified by {license_edition_type}has been removed, updated, or is no longer about to expire." |
Add a new, non-expired license:
|
Load Balancer Events
Load balancer events arise from NSX Edge nodes or from NSX Manager nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
LB CPU Very High | Medium | Load balancer CPU usage is very high. When event detected: "The CPU usage of load balancer {entity_id} is very high. The threshold is {system_usage_threshold}%." When event resolved: "The CPU usage of load balancer {entity_id} is low enough. The threshold is {system_usage_threshold}%." |
If the load balancer CPU utilization is higher than system usage threshold, the workload is too high for this load balancer. Rescale the load balancer service by changing the load balancer size from small to medium or from medium to large. If the CPU utilization of this load balancer is still high, consider adjusting the edge appliance form factor size or moving load balancer services to other edge nodes for the applicable workload. |
LB Status Down |
Critical | Centralized load balancer service is down. When event detected: "The centralized load balancer service {entity_id} is down." When event resolved: " The centralized load balancer service {entity_id} is up." |
|
Virtual Server Status Down | Medium | Load balancer virtual service is down. When event detected: "The load balancer virtual server {entity_id} is down." When event resolved: "The load balancer virtual server {entity_id} is up." |
Consult the load balancer pool to determine its status and verify its configuration. If incorrectly configured, reconfigure it and remove the load balancer pool from the virtual server then re-add it to the virtual server again. |
Pool Status Down | Medium | Load balancer pool is down. When event detected: "The load balancer pool {entity_id} status is down." When event resolved: "The load balancer pool {entity_id} status is up." |
When the health of the member is established, the pool member status is updated to healthy based on the 'Rise Count' configuration in the monitor. |
LB Status Degraded |
Medium | Starting in NSX-T Data Center 3.1.2. Load balancer service is degraded. When event detected: "The load balancer service {entity_id} is degraded." When event resolved: "The load balancer service {entity_id} is not degraded." |
|
DLB Status Down |
Critical | Starting in NSX-T Data Center 3.1.2. Distributed load balancer service is down. When event detected: "The distributed load balancer service {entity_id} is down." When event resolved: "The distributed load balancer service {entity_id} is up." |
|
LB Edge Capacity In Use High |
Medium | Starting in NSX-T Data Center 3.1.2. Load balancer usage is high When event detected: "The usage of load balancer service in Edge node {entity_id} is high. The threshold is {system_usage_threshold}%." When event resolved: "The usage of load balancer service in Edge node {entity_id} is low enough. The threshold is {system_usage_threshold}%." |
If multiple LB instances have been configured in this edge node, deploy a new edge node and move some LB instances to that new edge node. If only a single LB instance (small/medium/etc) has been configured in an edge node of same size (small/medium/etc), deploy a new edge of bigger size and move the LB instance to that new edge node. |
LB Pool Member Capacity In Use Very High |
Critical | Starting in NSX-T Data Center 3.1.2. Load balancer pool member usage is very high. When event detected: "The usage of pool members in Edge node {entity_id} is very high. The threshold is {system_usage_threshold}%." When event resolved: "The usage of pool members in Edge node {entity_id} is low enough. The threshold is {system_usage_threshold}%." |
Deploy a new edge node and move the load balancer service from existing edge nodes to the newly deployed edge node. |
Load Balancing Configuration Not Realized Due To Lack Of Memory |
Medium | Load balancer configuration is not realized due to high memory usage on the edge node. When event detected: "The load balancer configuration {entity_id} is not realized, due to high memory usage on Edge node {transport_node_id}." When event resolved: "The load balancer configuration {entity_id} is realized on {transport_node_id}.", |
|
Manager Health Events
NSX Manager health events arise from the NSX Manager node cluster.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Duplicate IP Address | Medium | Manager node's IP address is in use by another device. When event detected: "Manager node {entity_id} IP address {duplicate_ip_address} is currently being used by another device in the network." When event detected: "The device using the IP address assigned to Manager node {entity_id} appears to no longer be using {duplicate_ip_address}." |
|
Manager CPU Usage Very High | Critical | Manager node CPU usage is very high. When event detected: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
Manager CPU Usage High | Medium | Starting in NSX-T Data Center 3.0.1. Manager node CPU usage is high. When event detected: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The CPU usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
Manager Memory Usage Very High | Critical | Starting in NSX-T Data Center 3.0.1. Manager node memory usage is very high. When event detected: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
Manager Memory Usage High | Medium | Manager node memory usage is high. When event detected: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The memory usage on the Manager node {entity_id} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Please review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
Manager Disk Usage Very High | Critical | Manager node disk usage is very high. When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the very high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the very high threshold value of {system_usage_threshold}%." |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
Manager Disk Usage High | Medium | Manager node disk usage is high. When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%." When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
Manager Config Disk Usage Very High |
Critical | Manager node config disk usage is very high. When event detected: ": "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of high disk usage by the NSX Datastore service under the /config/corfu directory." When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%." |
Please run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py |
Manager Config Disk Usage High | Medium | Manager node config disk usage is high. When event detected: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk usage by the NSX Datastore service under the /config/corfu directory." When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}%, which is below the high threshold value of {system_usage_threshold}%." |
Examine the /config partition and see if there are any unexpected large files that can be removed. |
Operations DB Disk Usage High |
Medium | Manager node nonconfig disk usage is high. When event detected: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk usage by the NSX Datastore service under the /nonconfig/corfu directory." When event resolved: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%." |
Please run the following tool, and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig. |
Operations DB Disk Usage Very High | Critical | Manager node nonconfig disk usage is very high. When event detected: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of high disk usage by the NSX Datastore service under the /nonconfig/corfu directory." When event resolved: ": "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%." |
Please run the following tool, and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig. |
NCP Events
NSX Container Plug-in (NCP) events arise from the ESXi and KVM nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
NCP Plugin Down | Critical | Manager Node has detected the NCP is down or unhealthy. When event detected: "Manager Node has detected the NCP is down or unhealthy." When event resolved: "Manager Node has detected the NCP is up or healthy again." |
|
Node Agents Health Events
Node agent health events arise from the ESXi and KVM nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Node Agents Down | High | The agents running inside the Node VM appear to be down. When event detected: "The agents running inside the node VM appear to be down." When event resolved: "The agents inside the Node VM are running." |
For ESX:
For KVM:
For both ESX and KVM:
|
NSX Federation Events
NSX Federation events arise from the NSX Manager, NSX Edge, and the public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
GM To GM Latency Warning |
Medium | Latency between Global Managers is higher than expected for more than 2 minutes. When event detected: "Latency is higher than expected between Global Managers {from_gm_path} and {to_gm_path}." When event resolved: "Latency is below expected levels between Global Managers {from_gm_path} and {to_gm_path}." |
Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) through ping. If they are not pingable, check for flakiness in WAN connectivity. |
GM To GM Synchronization Error |
High | Active Global Manager to Standby Global Manager cannot synchronize for more than 5 minutes. When event detected: "Active Global Manager {from_gm_path} to Standby Global Manager {to_gm_path} cannot synchronize for more than 5 minutes." When event resolved: "Synchronization from active Global Manager {from_gm_path} to standby {to_gm_path} is healthy." |
Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) through ping. |
GM To GM Synchronization Warning |
Medium | Active Global Manager to Standby Global Manager cannot synchronize. When event detected: ": "Active Global Manager {from_gm_path} to Standby Global Manager {to_gm_path} cannot synchronize." When event resolved: "Synchronization from active Global Manager {from_gm_path} to Standby Global Manager {to_gm_path} is healthy." |
Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) through ping. |
LM to LM Synchronization Error |
High | Starting in NSX-T Data Center 3.0.1. Synchronization between remote locations failed for more than 5 minutes. When event detected: "The synchronization between {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for more than 5 minutes." When event resolved: "Remote sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized." |
|
LM to LM Synchronization Warning | Medium | Starting in NSX-T Data Center 3.0.1. Synchronization between remote locations failed. When event detected: "The synchronization between {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed." When event resolved: "Remote locations {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized." |
|
RTEP BGP Down | High | Starting in NSX-T Data Center 3.0.1. RTEP BGP neighbor down. When event detected: "RTEP (Remote Tunnel Endpoint) BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is down." When event resolved: "RTEP (Remote Tunnel Endpoint) BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is established." |
|
GM To LM Synchronization Warning |
Medium | Data synchronization between Global Manager (GM) and Local Manager (LM) failed. When event detected: "Data synchronization between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for the {flow_identifier}." When event resolved: "Sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized for {flow_identifier}." |
|
GM To LM Synchronization Error |
High | Data synchronization between Global Manager (GM) and Local Manager (LM) failed for an extended period. When event detected: "Data synchronization between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for the {flow_identifier} for an extended period." When event resolved: "Sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized for {flow_identifier}." |
|
Queue Occupancy Threshold Exceeded |
Medium | Queue occupancy size threshold exceeded warning. When event detected: "Queue ({queue_name}) used for syncing data between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached size {queue_size} which is at or above the maximum threshold of {queue_size_threshold}%." When event resolved: "Queue ({queue_name}) used for syncing data between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached size {queue_size} which is below the maximum threshold of {queue_size_threshold}%." |
Queue size can exceed threshold due to communication issue with remote site or an overloaded system. Please check system performance and /var/log/async-replicator/ar.log to see if there are any errors reported. |
GM To LM Latency Warning | Medium | Latency between global manager and local manager is higher than expected for more than 2 minutes. When event detected: "Latency between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached {latency_value} which is above the threshold value of {latency_threshold}." When event resolved: "Latency between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached {latency_value} which below the threshold value of {latency_threshold}." |
|
Cluster Degraded |
Medium | Group member is down. When event detected: "Group member {manager_node_id} of service {group_type} is down." When event resolved: "Group member {manager_node_id} of {group_type} is up." |
|
Cluster Unavailable |
High | All the group members of the service are down. When event detected: "All group members {manager_node_ids} of service {group_type} are down."When event resolved: "All group members {manager_node_ids} of service {group_type} are up." |
|
Password Management Events
Password management events arise from the NSX Manager, NSX Edge, and the public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
Password Expired | Critical | User password has expired. When event detected: "The password for user {username} has expired." When event resolved: "The password for the user {username} has been changed successfully or is no longer expired." |
The password for the user {username} must be changed now to access the system. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:
where |
Password About To Expire | High | User password is about to expire. When event detected: "The password for user {username} is about to expire in {password_expiration_days} days."" When event resolved: "The password for the user {username} has been changed successfully or is no longer about to expire." |
Ensure the password for the user identified by {username} is changed immediately. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:
where |
Password Expiration Approaching | Medium | User password is approaching expiration. When event detected: "The password for user {username} is about to expire in {password_expiration_days} days." When event resolved: "The password for the user {username} has been changed successfully or is no longer about to expire." |
The password for the user identified by {username} needs to be changed soon. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body:
where |
Routing Events
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
BGP Down | High | BGP neighbor down. When event detected: "In Router {entity_id}, BGP neighbor {bgp_neighbor_ip} is down, reason: {failure_reason}." When event resolved: "In Router {entity_id}, BGP neighbor {bgp_neighbor_ip} is up." |
|
BFD Down On External Interface |
High | BFD session is down. When event detected: "In router {entity_id}, BFD session for peer {peer_address} is down." When event resolved: "In router {entity_id}, BFD session for peer {peer_address} is up." |
|
Routing Down | High | All BGP/BFD sessions are down. When event detected: "All BGP/BFD sessions are down." When event resolved: "At least one BGP/BFD sessions up." |
|
Static Routing Removed | High | Static route removed. When event detected: "In router {entity_id}, static route {static_address} was removed because BFD was down." When event resolved: "In router {entity_id}, static route {static_address} was re-added as BFD recovered." |
|
MTU Mismatch Within Transport Zone | High | MTU configuration mismatch between Transport Nodes (ESXi, KVM and Edge) attached to the same Transport Zone. MTU values on all switches attached to the same Transport Zone not being consistent will cause connectivity issues. |
|
Global Router MTU Too Big | Medium | The global router MTU configuration is bigger than MTU of switches in overlay Transport Zone which connects to Tier0 or Tier1. Global router MTU value should be less than all switches MTU value by at least a 100 as we require 100 quota for Geneve encapsulation. |
|
Transport Node Health
Transport node health events arise from the KVM and ESXi nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
LAG Member Down | Medium | LACP reporting member down. When event detected: "LACP reporting member down." When event resolved: "LACP reporting member up." |
Check the connection status of LAG members on hosts.
|
Transport Node Uplink Down |
Medium | Uplink is going down. When event detected: "Uplink is going down." When event resolved: "Uplink is going up." |
Check the physical NICs status of uplinks on hosts.
|
VPN Events
VPN events arise from the NSX Edge and public gateway nodes.
Event Name | Severity | Alert Message | Recommended Action |
---|---|---|---|
IPsec Policy-Based Session Down | Medium | Policy-based IPsec VPN session is down. When event detected: "The policy-based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}." When event resolved: "The policy-based IPsec VPN session {entity_id} is up. |
Check IPsec VPN session configuration and resolve errors based on the session down reason. |
IPsec Route-Based Session Down | Medium | Route-based IPsec VPN session is down. When event detected: "The route-based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}." When event resolved: "The route-based IPsec VPN session {entity_id} is up." |
Check IPsec VPN session configuration and resolve errors based on the session down reason. |
IPsec Policy-Based Tunnel Down | Medium | Policy-based IPsec VPN tunnels are down. When event detected: "One or more policy-based IPsec VPN tunnels in session {entity_id} are down." When event resolved: "All policy-based IPsec VPN tunnels in session {entity_id} are up." |
Check IPsec VPN session configuration and resolve errors based on the tunnel down reason. |
IPsec Route-Based Tunnel Down | Medium | Route-based IPsec VPN tunnels are down. When event detected: "One or more route-based IPsec VPN tunnels in session {entity_id} are down." When event resolved: "All route-based IPsec VPN tunnels in session {entity_id} are up." |
Check IPsec VPN session configuration and resolve errors based on the tunnel down reason. |
L2VPN Session Down | Medium | L2VPN session is down. When event detected: "The L2VPN session {entity_id} is down." When event resolved: "The L2VPN session {entity_id} is up." |
Check IPsec VPN session configuration and resolve errors based on the reason. |
IPsec Service Down |
Medium | IPsec service is down. To view the reason why service is down, click the View Runtime Details link. When event detected: "The IPsec service {entity_id} is down." When event resolved: "The IPsec service {entity_id} is up." |
|