NSX Event Catalog
The following tables describe events that trigger alarms in VMware NSX™, including alarm messages and recommended actions to resolve them. Any event with a severity greater than LOW triggers an alarm. Alarms information is displayed in several locations within the NSX Manager interface. Alarm and event information is also included with other notifications in the Notifications drop-down menu in the title bar. To view alarms, navigate to the Home page and click the Alarms tab. For more information on alarms and events, see " Working with Events and Alarms" in the NSX Administration Guide.
Alarm Management Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Alarm Service Overloaded | Critical | global-manager, manager | The alarm service is overloaded. |
Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new alarms again. |
3.0.0 |
Heavy Volume Of Alarms | Critical | global-manager, manager | Heavy volume of a specific alarm type detected. |
Review all active alarms of type {event_id} using the Alarms page in the NSX UI or using the NSX API GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED. For each active alarm investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new {event_id} alarms again. |
3.0.0 |
Audit Log Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Audit Log File Update Error | Critical | global-manager, manager, edge, public-cloud-gateway, esx, kvm, bms | At least one of the monitored log files cannot be written to. |
1. On Manager and Global Manager nodes, Edge and Public Cloud Gateway nodes, Ubuntu KVM Host nodes ensure the permissions for the /var/log directory is 775 and the ownership is root:syslog. One Rhel KVM and BMS Host nodes ensure the permission for the /var/log directory is 755 and the ownership is root:root. |
3.1.0 |
Remote Logging Server Error | Critical | global-manager, manager, edge, public-cloud-gateway | Log messages undeliverable due to incorrect remote logging server configuration. |
1. Ensure that {hostname_or_ip_address_with_port} is the correct hostname or IP address and port. |
3.1.0 |
Capacity Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Minimum Capacity Threshold | Medium | manager | A minimum capacity threshold has been breached. |
Navigate to the capacity page under the NSX overview UI of respective feature ( i.e. Networking, Security, Inventory and System) and review current usage versus threshold limits. If the current usage is expected, consider increasing the minimum threshold values. If the current usage is unexpected, review the network policies configured to decrease usage at or below the minimum threshold. |
3.1.0 |
Maximum Capacity Threshold | High | manager | A maximum capacity threshold has been breached. |
Navigate to the capacity page under the NSX overview UI of respective feature ( i.e. Networking, Security, Inventory and System) and review current usage versus threshold limits. If the current usage is expected, consider increasing the maximum threshold values. If the current usage is unexpected, review the network policies configured to decrease usage at or below the maximum threshold. |
3.1.0 |
Maximum Capacity | Critical | manager | A maximum capacity has been breached. |
Ensure that the number of NSX objects created is within the limits supported by NSX. If there are any unused objects, delete them using the respective NSX UI or API from the system. Consider increasing the form factor of all Manager nodes and/or Edge nodes. Note that the form factor of each node type should be the same. If not the same, the capacity limits for the lowest form factor deployed are used. |
3.1.0 |
Certificates Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Certificate Expired | Critical | global-manager, manager | A certificate has expired. |
Ensure services that are currently using the certificate are updated to use a new, non-expired certificate. Once the expired certificate is no longer in use, it should be deleted by invoking the DELETE {api_collection_path}{entity_id} NSX API. If the expired certificate is used by NAPP Platform, the connection is broken between NSX and NAPP Platform. Check the NAPP Platform troubleshooting document to use a self-signed NAPP CA certificate for recovering the connection. |
3.0.0 |
Certificate Is About To Expire | High | global-manager, manager | A certificate is about to expire. |
Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. Once the expiring certificate is no longer in use, it should be deleted by invoking the DELETE {api_collection_path}{entity_id} NSX API. |
3.0.0 |
Certificate Expiration Approaching | Medium | global-manager, manager | A certificate is approaching expiration. |
Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. Once the expiring certificate is no longer in use, it should be deleted by invoking the DELETE {api_collection_path}{entity_id} NSX API. |
3.0.0 |
CA Bundle Update Recommended | High | global-manager, manager | The update for a trusted CA bundle is recommended. |
Ensure services that are currently using the trusted CA bundle are updated to use a recently-updated trusted CA bundle. Unless it is system-provided bundle, the bundle can be updated using the PUT /policy/api/v1/infra/cabundles/{entity_id} NSX API. Once the expired bundle is no longer in use, it should be deleted (if not system-provided) by invoking the DELETE /policy/api/v1/infra/cabundles/{entity_id} NSX API. |
3.2.0 |
CA Bundle Update Suggested | Medium | global-manager, manager | The update for a trusted CA bundle is suggested. |
Ensure services that are currently using the trusted CA bundle are updated to use a recently-updated trusted CA bundle. Unless it is system-provided bundle, the bundle can be updated using the PUT /policy/api/v1/infra/cabundles/{entity_id} NSX API. Once the expired bundle is no longer in use, it should be deleted (if not system-provided) by invoking the DELETE /policy/api/v1/infra/cabundles/{entity_id} NSX API. |
3.2.0 |
Transport Node Certificate Expired | Critical | bms, edge, esx, kvm, public-cloud-gateway | A certificate has expired. |
Replace the Transport node {entity_id} certificate with a non-expired certificate. The expired certificate should be replaced by invoking the POST /api/v1/trust-management/certificates/action/replace-host-certificate/{entity_id} NSX API. If the expired certificate is used by Transport node, the connection is broken between Transport node and Manager node. |
4.1.0 |
Transport Node Certificate Is About To Expire | High | bms, edge, esx, kvm, public-cloud-gateway | A certificate is about to expire. |
Replace the Transport node {entity_id} certificate with a non-expired certificate. The expired certificate should be replaced by invoking the POST /api/v1/trust-management/certificates/action/replace-host-certificate/{entity_id} NSX API. If the certificate is not replaced, when the certificate expires the connection between the Transport node and the Manager node will be broken. |
4.1.0 |
Transport Node Certificate Expiration Approaching | Medium | bms, edge, esx, kvm, public-cloud-gateway | A certificate is approaching expiration. |
Replace the Transport node {entity_id} certificate with a non-expired certificate. The expired certificate should be replaced by invoking the POST /api/v1/trust-management/certificates/action/replace-host-certificate/{entity_id} NSX API. If the certificate is not replaced, when the certificate expires the connection between the Transport node and the Manager node will be broken. |
4.1.0 |
Clustering Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Cluster Degraded | Medium | global-manager, manager | Group member is down. |
1. Invoke the NSX CLI command 'get cluster status' to view the status of group members of the cluster. |
3.2.0 |
Cluster Unavailable | High | global-manager, manager | All the group members of the service are down. |
1. Ensure the service for {group_type} is running on node. Invoke the GET /api/v1/node/services/<service_name>/status NSX API or the get service <service_name> NSX CLI command to determine if the service is running. If not running, invoke the POST /api/v1/node/services/<service_name>?action=restart NSX API or the restart service <service_name> NSX CLI to restart the service. |
3.2.0 |
Cni Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Hyperbus Manager Connection Down On DPU | Medium | dpu | Detect and report when the container network infrastructure channel is unhealthy on DPU. |
On the DPU {dpu_id} where nsx-cfgagent is running: |
4.0.0 |
Hyperbus Manager Connection Down | Medium | esx, kvm | Detect and report when the container network infrastructure channel is unhealthy. |
On the ESXi node where cfgAgent is running: |
3.0.0 |
Communication Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Limited Reachability On DPU | Medium | dpu | The given collector can not be reached via vmknic(s) on given DVS on DPU. |
If the warning is on, it does not mean the collector is unreachable. The exported flows generated by the vertical based on DVS {dvs_alias} can still reach the collector {collector_ip} via vmknic(s) on DVS(es) besides of DVS {dvs_alias}. This alarm is just a risk reminder, in the future, the DVS which the vmknic(s) bases on might migrate to other DPU or move back to host. So it is better for user to create vmknic(s) with stack {stack_alias} on DVS {dvs_alias} and configure it with appropriate IPv4(6) address from vCenter, then check if the {vertical_name} collector {collector_ip} can be reached via the newly-created vmknic(s) on DPU {dpu_id} by invoking vmkping {collector_ip} -S {stack_alias} -I vmkX on DPU. |
4.0.1 |
Unreachable Collector On DPU | Critical | dpu | The given collector can not be reached via existing vmknic(s) on DPU at all. |
To make the collector reachable for given vertical on the DVS, user has to make sure there is(are) vmknic(s) with expected stack {stack_alias} created and configured with appropriate IPv4(6) addresses, and the network connection to {vertical_name} collector {collector_ip} is also fine. So user has to do the checking on DPU {dpu_id}, and perform required configuration to make sure the condition is met. Finally if vmkping {collector_ip} -S {stack_alias} on DPU succeeds, this indicates the problem is gone. |
4.0.1 |
Manager Cluster Latency High | Medium | manager | The average network latency between Manager nodes is high. |
Ensure there are no firewall rules blocking ping traffic between the Manager nodes. If there are other high bandwidth servers and applications sharing the local network, consider moving these to a different network. |
3.1.0 |
Control Channel To Manager Node Down Too Long | Critical | bms, edge, esx, kvm, public-cloud-gateway | Transport node's control plane connection to the Manager node is down for long. |
1. Check the connectivity from Transport node {entity_id} to Manager node {appliance_address} interface via ping. If they are not pingable, check for flakiness in network connectivity. |
3.1.0 |
Control Channel To Manager Node Down | Medium | bms, edge, esx, kvm, public-cloud-gateway | Transport node's control plane connection to the Manager node is down. |
1. Check the connectivity from Transport node {entity_id} to Manager node {appliance_address} interface via ping. If they are not pingable, check for flakiness in network connectivity. |
3.1.0 |
Control Channel To Transport Node Down | Medium | manager | Controller service to Transport node's connection is down. |
1. Check the connectivity from the Controller service {central_control_plane_id} and Transport node {entity_id} interface via ping and traceroute. This can be done on the NSX Manager node admin CLI. The ping test should not see drops and have consistent latency values. VMware recommends latency values of 150ms or less. |
3.1.0 |
Control Channel To Transport Node Down Long | Critical | manager | Controller service to Transport node's connection is down for too long. |
1. Check the connectivity from the Controller service {central_control_plane_id} and Transport node {entity_id} interface via ping and traceroute. This can be done on the NSX Manager node admin CLI. The ping test should not see drops and have consistent latency values. VMware recommends latency values of 150ms or less. |
3.1.0 |
Control Channel To Antrea Cluster Down | Medium | manager | Controller service to Antrea cluster's connection is down. |
1. Check if the Antrea Kubernetes cluster is deleted. |
4.1.1 |
Control Channel To Antrea Cluster Down Long | Critical | manager | Controller service to Antrea cluster's connection is down for too long. |
1. Check if the Antrea Kubernetes cluster is deleted. |
4.1.1 |
Manager Control Channel Down | Critical | manager | Manager to controller channel is down. |
1. On Manager node {manager_node_name} ({appliance_address}), invoke the following NSX CLI command: get service applianceproxy to check the status of the service periodically for 60 minutes. |
3.0.2 |
Management Channel To Transport Node Down | Medium | manager | Management channel to Transport node is down. |
Ensure there is network connectivity between the Manager nodes and Transport node {transport_node_name} ({transport_node_address}) and no firewalls are blocking traffic between the nodes. On Windows Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 status in the Windows PowerShell. If it is not running, restart it by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 restart. On all other Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command /etc/init.d/nsx-proxy status. If it is not running, restart it by invoking the command /etc/init.d/nsx-proxy restart. |
3.0.2 |
Management Channel To Transport Node Down Long | Critical | manager | Management channel to Transport node is down for too long. |
Ensure there is network connectivity between the Manager nodes and Transport node {transport_node_name} ({transport_node_address}) and no firewalls are blocking traffic between the nodes. On Windows Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 status in the Windows PowerShell. If it is not running, restart it by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 restart. On all other Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command /etc/init.d/nsx-proxy status. If it is not running, restart it by invoking the command /etc/init.d/nsx-proxy restart. |
3.0.2 |
Manager FQDN Lookup Failure | Critical | global-manager, bms, edge, esx, kvm, manager, public-cloud-gateway | DNS lookup failed for Manager node's FQDN. |
1. Assign correct FQDNs to all Manager nodes and verify the DNS configuration is correct for successful lookup of all Manager nodes' FQDNs. |
3.1.0 |
Manager FQDN Reverse Lookup Failure | Critical | global-manager, manager | Reverse DNS lookup failed for Manager node's IP address. |
1. Assign correct FQDNs to all Manager nodes and verify the DNS configuration is correct for successful reverse lookup of the Manager node's IP address. |
3.1.0 |
Management Channel To Manager Node Down | Medium | bms, edge, esx, kvm, public-cloud-gateway | Management channel to Manager node is down. |
Ensure there is network connectivity between the Transport node {transport_node_id} and leader Manager node. Also ensure no firewalls are blocking traffic between the nodes. Ensure the messaging manager service is running on Manager nodes by invoking the command /etc/init.d/messaging-manager status. If the messaging manager is not running, restart it by invoking the command /etc/init.d/messaging-manager restart. |
3.2.0 |
Management Channel To Manager Node Down Long | Critical | bms, edge, esx, kvm, public-cloud-gateway | Management channel to Manager node is down for too long. |
Ensure there is network connectivity between the Transport node {transport_node_id} and leader Manager nodes. Also ensure no firewalls are blocking traffic between the nodes. Ensure the messaging manager service is running on Manager nodes by invoking the command /etc/init.d/messaging-manager status. If the messaging manager is not running, restart it by invoking the command /etc/init.d/messaging-manager restart. |
3.2.0 |
Network Latency High | Medium | manager | Management to Transport node network latency is high. |
1. Wait for 5 minutes to see if the alarm automatically gets resolved. |
4.0.0 |
DHCP Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Pool Lease Allocation Failed | High | edge, autonomous-edge, public-cloud-gateway | IP addresses in an IP Pool have been exhausted. |
Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking | Segments | Segment page in the NSX UI. |
3.0.0 |
Pool Overloaded | Medium | edge, autonomous-edge, public-cloud-gateway | An IP Pool is overloaded. |
Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking | Segments | Segment page in the NSX UI. |
3.0.0 |
Distributed Firewall Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
DFW CPU Usage Very High | Critical | esx | DFW CPU usage is very high. |
Consider re-balancing the VM workloads on this host to other hosts. Review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter. |
3.0.0 |
DFW CPU Usage Very High On DPU | Critical | dpu | DFW CPU usage is very high on dpu. |
Consider re-balancing the VM workloads on this host to other hosts. Review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter. |
4.0.0 |
DFW Memory Usage Very High | Critical | esx | DFW Memory usage is very high. |
View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the host. Consider re-balancing the workloads on this host to other hosts. |
3.0.0 |
DFW Memory Usage Very High On DPU | Critical | dpu | DFW Memory usage is very high on DPU. |
View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the DPU. Consider re-balancing the workloads on this host to other hosts. |
4.0.0 |
DFW Vmotion Failure | Critical | esx | DFW vMotion failed, port disconnected. |
Check VMs on the host in NSX Manager, manually repush the DFW configuration through NSX Manager UI. The DFW policy to be repushed can be traced by the DFW LSP {entity_id}. Also consider finding the VM to which the DFW LSP is attached and restart it. |
3.2.0 |
DFW Flood Limit Warning | Medium | esx | DFW flood limit has reached warning level. |
Check VMs on the host in NSX Manager, check configured flood warning level of the DFW filter {entity_id} for protocol {protocol_name}. |
4.1.0 |
DFW Flood Limit Critical | Critical | esx | DFW flood limit has reached critical level. |
Check VMs on the host in NSX Manager, check configured flood critical level of the DFW filter {entity_id} for protocol {protocol_name}. |
4.1.0 |
DFW Session Count High | Critical | esx | DFW session count is high. |
Review the network traffic load level of the workloads on the host. Consider re-balancing the workloads on this host to other hosts. |
3.2.0 |
DFW Rules Limit Per vNIC Exceeded | Critical | esx | DFW rules limit per vNIC is about to exceed the maximum limit. |
Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall <VIF_UUID> ruleset rules to get the rule statistics for rules configured on the corresponding VIF. Reduce the number of rules configured for VIF {entity_id}. |
4.0.0 |
DFW Rules Limit Per vNIC Approaching | Medium | esx | DFW rules limit per vNIC is approaching the maximum limit. |
Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall <VIF_UUID> ruleset rules to get the rule statistics for rules configured on the corresponding VIF. Reduce the number of rules configured for VIF {entity_id}. |
4.0.0 |
DFW Rules Limit Per Host Exceeded | Critical | esx | DFW rules limit per host is about to exceed the maximum limit. |
Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall rule-stats total to get the rule statistics for rules configured on the ESX host {transport_node_name}. Reduce the number of rules configured for host {transport_node_name}. Check the number of rules configured for various VIFs by using NSX CLI command get firewall <VIF_UUID> ruleset rules. Reduce the number of rules configured for various VIFs. |
4.0.0 |
DFW Rules Limit Per Host Approaching | Medium | esx | DFW rules limit per host is approaching the maximum limit. |
Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall rule-stats total to get the rule statistics for rules configured on the ESX host {transport_node_name}. Reduce the number of rules configured for host {transport_node_name}. Check the number of rules configured for various VIFs by using NSX CLI command get firewall <VIF_UUID> ruleset rules. Reduce the number of rules configured for various VIFs. |
4.0.0 |
Distributed IDS IPS Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Max Events Reached | Medium | manager | Max number of intrusion events reached. |
There is no manual intervention required. A purge job will kick in automatically every 3 minutes and delete 10% of the older records to bring the total intrusion events count in the system below the threshold value. |
3.1.0 |
NSX IDPS Engine Memory Usage High | Medium | esx | NSX-IDPS engine memory usage reaches 75% or above. |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
NSX IDPS Engine Memory Usage High On DPU | Medium | dpu | NSX-IDPS engine memory usage reaches 75% or above on DPU. |
Consider re-balancing the VM workloads on this host to other hosts. |
4.0.0 |
NSX IDPS Engine Memory Usage Medium High | High | esx | NSX-IDPS Engine memory usage reaches 85% or above. |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
NSX IDPS Engine Memory Usage Medium High On DPU | High | dpu | NSX-IDPS Engine memory usage reaches 85% or above on DPU. |
Consider re-balancing the VM workloads on this host to other hosts. |
4.0.0 |
NSX IDPS Engine Memory Usage Very High | Critical | esx | NSX-IDPS engine memory usage reaches 95% or above. |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
NSX IDPS Engine Memory Usage Very High On DPU | Critical | dpu | NSX-IDPS engine memory usage reaches 95% or above on DPU. |
Consider re-balancing the VM workloads on this host to other hosts. |
4.0.0 |
NSX IDPS Engine CPU Usage High (deprecated) | Medium | esx | NSX-IDPS engine CPU usage reaches 75% or above. |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
NSX IDPS Engine CPU Usage Medium High (deprecated) | High | esx | NSX-IDPS engine CPU usage reaches 85% or above. |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
NSX IDPS Engine CPU Usage Very High (deprecated) | Critical | esx | NSX-IDPS engine CPU usage exceeded 95% or above. |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
NSX IDPS Engine Down | Critical | esx | NSX IDPS is activated via NSX Policy and IDPS rules are configured, but NSX-IDPS engine is down. |
1. Check /var/log/nsx-syslog.log and /var/log/syslog.log to see if there are errors reported. |
3.1.0 |
NSX IDPS Engine Down On DPU | Critical | dpu | NSX IDPS is activated via NSX Policy and IDPS rules are configured, but NSX-IDPS engine is down on DPU. |
1. Check /var/log/nsx-syslog.log and /var/log/syslog.log to see if there are errors reported. |
4.0.0 |
IDPS Engine CPU Oversubscription High | Medium | esx | CPU utilization for distributed IDPS engine is high. |
Review reason for oversubscription. Move certain applications to different host. |
4.0.0 |
IDPS Engine CPU Oversubscription Very High | High | esx | CPU utilization for distributed IDPS engine is very high. |
Review reason for oversubscription. Move certain applications to different host. |
4.0.0 |
IDPS Engine Network Oversubscription High | Medium | esx | Network utilization for distributed IDPS engine is high. |
Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service. |
4.0.0 |
IDPS Engine Network Oversubscription Very High | High | esx | Network utilization for distributed IDPS engine is very high. |
Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service. |
4.0.0 |
IDPS Engine Dropped Traffic CPU Oversubscribed | Critical | esx | Distributed IDPS Engine Dropped Traffic due to CPU Oversubscription. |
Review reason for oversubscription. Move certain applications to different host. |
4.0.0 |
IDPS Engine Dropped Traffic Network Oversubscribed | Critical | esx | Distributed IDPS Engine Dropped Traffic due to Network Oversubscription. |
Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service. |
4.0.0 |
IDPS Engine Bypassed Traffic CPU Oversubscribed | Critical | esx | Distributed IDPS Engine Bypassed Traffic due to CPU Oversubscription. |
Review reason for oversubscription. Move certain applications to different host. |
4.0.0 |
IDPS Engine Bypassed Traffic Network Oversubscribed | Critical | esx | Distributed IDPS Engine Bypassed Traffic due to Network Oversubscription. |
Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service. |
4.0.0 |
Site Connection Loss (deprecated) | Medium | manager | Connection between IDPS Reporting Service and NSX+ failed. |
Restart IDPS Reporting Service on all manager nodes. |
4.1.1 |
Host Pcap Partition Full | Medium | esx | The partition for PCAPs on the host is full. |
Check the NSX IDS/IPS policies/rules and reduce the number of rules and policies with PCAP-enabled profiles. |
4.2.0 |
Message Transmission Failed | Medium | esx | Failed to send data from host to NSX Application Platform. |
Check for connectivity issues between the ESX host and messaging broker. |
4.2.0 |
DNS Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Forwarder Down | High | edge, autonomous-edge, public-cloud-gateway | A DNS forwarder is down. |
1. Invoke the NSX CLI command get dns-forwarders status to verify if the DNS forwarder is in down state. |
3.0.0 |
Forwarder Disabled (deprecated) | Info | edge, autonomous-edge, public-cloud-gateway | A DNS forwarder is deactivated. |
1. Invoke the NSX CLI command get dns-forwarders status to verify if the DNS forwarder is in the deactivated state. |
3.0.0 |
Forwarder Upstream Server Timeout | High | edge, autonomous-edge, public-cloud-gateway | One DNS forwarder upstream server has timed out. |
1. Invoke the NSX API GET /api/v1/dns/forwarders/{dns_id}/nslookup? address=<address>&server_ip={dns_upstream_ip}&source_ip=<source_ip>. This API request triggers a DNS lookup to the upstream server in the DNS forwarder's network namespace. <address> is the IP address or FQDN in the same domain as the upstream server. <source_ip> is an IP address in the upstream server's zone. If the API returns a connection timed out response, there is likely a network error or upstream server problem. Check why DNS lookups are not reaching the upstream server or why the upstream server is not returning a response. If the API response indicates the upstream server is answering, proceed to step 2. |
3.1.3 |
Edge Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Edge Node Settings Mismatch | Critical | manager | Edge node settings mismatch. |
Review the node settings of this Edge transport node {entity_id}. Follow one of following actions to resolve alarm - |
3.2.0 |
Edge VM vSphere Settings Mismatch | Critical | manager | Edge VM vSphere settings mismatch. |
Review the vSphere configuration of this Edge Transport Node {entity_id}. Follow one of following actions to resolve alarm - |
3.2.0 |
Edge Node Settings And vSphere Settings Are Changed | Critical | manager | Edge node settings and vSphere settings are changed. |
Review the node settings and vSphere configuration of this Edge Transport Node {entity_id}. Follow one of following actions to resolve alarm - |
3.2.0 |
Edge vSphere Location Mismatch | High | manager | Edge vSphere Location Mismatch. |
Review the vSphere configuration of this Edge Transport Node {entity_id}. Follow one of following actions to resolve alarm - |
3.2.0 |
Edge VM Present In NSX Inventory Not Present In vCenter | Critical | manager | Auto Edge VM is present in NSX inventory but not present in vCenter. |
The managed object reference moref id of a VM has the form vm-number, which is visible in the URL on selecting the Edge VM in vCenter UI. Example vm-12011 in https://<vc-url>/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-12011:164ff798-c4f1-495b-a0be-adfba337e5d2/summary Find the VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter for this Edge Transport Node {entity_id}. If the Edge VM is present in vCenter with a different moref id, follow the below action. Use NSX add or update placement API with JSON request payload properties vm_id and vm_deployment_config to update the new vm moref id and vSphere deployment parameters. POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=addOrUpdatePlacementReferences. If the Edge VM with name {policy_edge_vm_name} is not present in vCenter, use the NSX Redeploy API to deploy a new VM for the Edge node. POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=redeploy. |
3.2.1 |
Edge VM Not Present In Both NSX Inventory And vCenter (deprecated) | Critical | manager | Auto Edge VM is not present in both NSX inventory and in vCenter. |
The managed object reference moref id of a VM has the form vm-number, which is visible in the URL on selecting the Edge VM in vCenter UI. Example vm-12011 in https://<vc-url>/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-12011:164ff798-c4f1-495b-a0be-adfba337e5d2/summary Find the VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter for this Edge Transport Node {entity_id}. Follow the below action to resolve the alarm - Check if VM has been deleted in vSphere or is present with a different moref id. |
3.2.1 |
Failed To Delete The Old VM In vCenter During Redeploy | Critical | manager | Power off and delete operation failed for old Edge VM in vCenter during Redeploy. |
The managed object reference moref id of a VM has the form vm-number, which is visible in the URL on selecting the Edge VM in vCenter UI. Example vm-12011 in https://<vc-url>/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-12011:164ff798-c4f1-495b-a0be-adfba337e5d2/summary Find the VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter for this Edge Transport Node {entity_id}. Power off and delete the old Edge VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter. |
3.2.1 |
Edge Hardware Version Mismatch | Medium | manager | Edge node has hardware version mismatch. |
Please follow KB article to resolve hardware version mismatch alarm for Edge node {transport_node_name}. |
4.0.1 |
Stale Edge Node Entry Found | Critical | manager | Stale entries found for Edge Node. |
Please follow the KB article to clear the stale entries for the Edge Node {transport_node_name} with UUID {entity_id}. |
4.1.1 |
Uplink fp-eth Interface Mismatch During Replacement | Critical | manager | Uplinks to fp-eth interfaces mismatch. |
Update the mapping of uplinks to fp-eth interfaces {old_fp_eth_list} via UI or API - PUT https://<manager-ip>/api/v1/transport-nodes/<tn-id> as per the new fp-eth interfaces {new_fp_eth_list}. |
4.1.1 |
Edge Cluster Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Edge Cluster Member Relocate Failure | Critical | manager | Edge cluster member relocate failure alarm |
Review the available capacity for the Edge cluster. If more capacity is required, scale your Edge cluster. Retry the relocate Edge cluster member operation. |
4.0.0 |
Edge Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Edge CPU Usage Very High | Critical | edge, public-cloud-gateway | Edge node CPU usage is very high. |
Review the configuration, running services and sizing of this Edge node. Invoke the 'get processes monitor' command to identify which non-datapath processes have very high CPU usage. Note that any process with the name 'dp-fp' is the datapath process and can have high CPU usage based on traffic load. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
3.0.0 |
Edge CPU Usage High | Medium | edge, public-cloud-gateway | Edge node CPU usage is high. |
Review the configuration, running services and sizing of this Edge node. Invoke the 'get processes monitor' command to identify which non-datapath processes have high CPU usage. Note that any process with the name 'dp-fp' is the datapath process and can have high CPU usage based on traffic load. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
3.0.0 |
Edge Memory Usage Very High | Critical | edge, public-cloud-gateway | Edge node memory usage is very high. |
Review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
3.0.0 |
Edge Memory Usage High | Medium | edge, public-cloud-gateway | Edge node memory usage is high. |
Review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
3.0.0 |
Edge Disk Usage Very High | Critical | edge, public-cloud-gateway | Edge node disk usage is very high. |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
3.0.0 |
Edge Disk Usage High | Medium | edge, public-cloud-gateway | Edge node disk usage is high. |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
3.0.0 |
Edge Datapath CPU Very High | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node datapath CPU usage is very high. |
Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. If the packet rate is low while cpu usage is high then check if flow-cache is disabled by invoking the NSX CLI command get dataplane flow-cache config. If it is disabled, then consider re-enabling it using the command set dataplane flow-cache enabled followed by restart service dataplane (Note: This command will cause momentary disruption in traffic). Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters. |
3.0.0 |
Edge Datapath CPU High | Medium | edge, autonomous-edge, public-cloud-gateway | Edge node datapath CPU usage is high. |
Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. If the packet rate is low while cpu usage is high then check if flow-cache is disabled by invoking the NSX CLI command get dataplane flow-cache config. If it is disabled, then consider re-enabling it using the command set dataplane flow-cache enabled followed by restart service dataplane (Note: This command will cause momentary disruption in traffic). Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters. |
3.0.0 |
Edge Datapath Configuration Failure | High | edge, autonomous-edge, public-cloud-gateway | Edge node datapath configuration failed. |
Ensure the Edge node's connectivity to the Manager node is healthy. Invoke get services NSX Edge CLI to check the health of Datapath service. If the dataplane service is stopped, use start service dataplane to start the service. |
3.0.0 |
Edge Datapath Cryptodrv Down | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node crypto driver is down. |
The alarm is raised when datapath DPDK crypto module fails the integrity test or known answer tests at bring-up. This implies the crypto binaries are corrupted. Therefore recommended action is to redeploy this edge to load correct binaries. |
3.0.0 |
Edge Datapath Mempool High | Medium | edge, autonomous-edge, public-cloud-gateway | Edge node datapath mempool is high. |
Get the mempool usage using get dataplane memory stats CLI. |
3.0.0 |
Edge Global ARP Table Usage High | Medium | edge, autonomous-edge, public-cloud-gateway | Edge node global ARP table usage is high. |
Log in as the root user and invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/show and check if neigh cache usage is normal. If it is normal, invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/set_param max_entries to increase the ARP table size. |
3.0.0 |
Edge NIC Out Of Receive Buffer | Medium | edge, autonomous-edge, public-cloud-gateway | Edge node NIC is out of RX ring buffers temporarily. |
Run the NSX CLI command get dataplane cpu stats on the edge node and check: |
3.0.0 |
Edge NIC Out Of Transmit Buffer | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node NIC is out of TX ring buffers temporarily. |
1. If a lot of VMs are accommodated along with edge by the hypervisor then edge VM might not get time to run, hence the packets might not be retrieved by hypervisor. Then probably migrating the edge VM to a host with fewer VMs. |
3.0.0 |
Edge NIC Transmit Queue Overflow | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node NIC transmit queue has overflowed temporarily. |
1. If a lot of VMs are accommodated along with edge by the hypervisor then edge VM might not get time to run, hence the packets might not be retrieved by hypervisor. Then probably migrating the edge VM to a host with fewer VMs. |
4.1.1 |
Edge NIC Link Status Down | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node NIC link is down. |
On the Edge node confirm if the NIC link is physically down by invoking the NSX CLI command get physical-port <port-name>. If Baremetal Edge NIC is down verify the cable connection. If VM Edge NIC is down check if the NIC is associated with a DVS. |
3.0.0 |
Storage Error | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node disk is read-only. |
Examine the read-only partition to see if reboot resolves the issue or the disk needs to be replaced. Contact GSS for more information. |
3.0.1 |
Datapath Thread Deadlocked | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node's datapath thread is in deadlock condition. |
Restart the dataplane service by invoking the NSX CLI command restart service dataplane. |
3.1.0 |
Edge Datapath NIC Throughput Very High | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node datapath NIC throughput is very high (Baremetal Edge only). |
Examine the traffic thoughput levels on the NIC and determine whether configuration changes are needed. Increased throughput support can be provided by introducing LAG config or upgrading to a NIC with higher capacity. The 'get dataplane thoughput <seconds>' command can be used to monitor throughput. |
3.2.0 |
Edge Datapath NIC Throughput High | Medium | edge, autonomous-edge, public-cloud-gateway | Edge node datapath NIC throughput is high (Baremetal Edge only). |
Examine the traffic thoughput levels on the NIC and determine whether configuration changes are needed. Increased throughput support can be provided by introducing LAG config or upgrading to a NIC with higher capacity. The 'get dataplane thoughput <seconds>' command can be used to monitor throughput. |
3.2.0 |
Failure Domain Down | Critical | edge, public-cloud-gateway | All members of failure domain are down. |
1. On the Edge node identified by {transport_node_id}, check the connectivity to the management and control planes by invoking the NSX CLI command get managers and get controllers. |
3.2.0 |
Micro Flow Cache Hit Rate Low | Medium | edge, autonomous-edge, public-cloud-gateway | Micro Flow Cache hit rate decreases and Datapath CPU is high. |
The Cache Flow hit rate has decreased for the last 30 minutes which is an indication that there may be degradation on Edge performance. The traffic will continue to be forwarded and you may not experience any issues. Check the datapath CPU utilization for Edge {entity_id} core {core_id} if it is high for the last 30 minutes. The Edge will have low flow-cache hit rate when there are continuously new flows getting created because the first packet of any new flow will be used to setup to flow-cache for fast path processing. You may want to increase your Edge appliance size or increase the number of Edge nodes used for Active/Active Gateways. |
3.2.2 |
Mega Flow Cache Hit Rate Low | Medium | edge, autonomous-edge, public-cloud-gateway | Mega Flow Cache hit rate decreases and Datapath CPU is high. |
The Cache Flow hit rate has decreased for the last 30 minutes which is an indication that there may be degradation on Edge performance. The traffic will continue to be forwarded and you may not experience any issues. Check the datapath CPU utilization for Edge {entity_id} core {core_id} if it is high for the last 30 minutes. The Edge will have low flow-cache hit rate when there are continuously new flows getting created because the first packet of any new flow will be used to setup to flow-cache for fast path processing. You may want to increase your Edge appliance size or increase the number of Edge nodes used for Active/Active Gateways. |
3.2.2 |
Flow Cache Deactivated | Critical | edge, autonomous-edge, public-cloud-gateway | Flow cache deactivated. |
Please make sure the Flow cache for the Edge Transport Node {entity_id} and {transport_node_name} is activated. Deactivating the flow cache will cause traffic to be forwarded through the CPU. To enable flow cache, use the command set dataplane flow-cache enabled, followed by restart service dataplane. |
4.1.1 |
Bridge Port Loop Detected (deprecated) | Critical | edge, public-cloud-gateway | Edge detects possible L2 loop on bridged external physical network. |
Take action to resolve the alarm by removing possible L2 loops on Customer network. If the alarm appears again, you may have a case where a short MAC flap can trigger this alarm when the MAC address provided is moved intentionally to be behind the external network. The alarm is recommended to be ACKNOWLEDGED for this case. If the alarm is still active, check if the MAC address provided in the alarm is expected to be behind the external network and if this is the expected direction. See https://kb.vmware.com/s/article/95658 for additional information. |
4.2.0 |
Sub Numa Activated | Critical | edge | Edge Transport Node detected Sub-NUMA Clustering activated. |
Please take action to deactivate Sub-NUMA Clustering in the BIOS Settings of Edge Transport Node {transport_node_name} with UUID {entity_id}. |
4.2.0 |
Edge Agent Down | Medium | edge, autonomous-edge | Edge Agent liveness is down. |
1. On the Edge node invoke the NSX CLI command `get service local-controller state' several times, If the CLI succeeds, it might be a transient problem where the CPU load might be high. If CLI fails, continue to next check. |
4.2.0 |
Longrunning Packet Capture | Medium | edge | Long-running packet capture on edge interface. |
Please take action to delete capture session using del capture session {packet_capture_session_id} interface {source_interface} command, if session is no longer needed. |
4.2.0 |
Endpoint Protection Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
EAM Status Down | Critical | manager | ESX Agent Manager (EAM) service on a compute manager is down. |
Start the ESX Agent Manager (EAM) service. SSH into vCenter and invoke the command service-control --start vmware-eam. |
3.0.0 |
Partner Channel Down | Critical | esx | Host module and Partner SVM connection is down. |
Refer to the referenced KB to debug this issue and make sure that Partner SVM {entity_id} is re-connected to the host module. You can also run the 'NxgiPlatform' Runbook on this particular Transport node for help in troubleshooting. |
3.0.0 |
Federation Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Rtep BGP Down | High | edge, autonomous-edge, public-cloud-gateway | RTEP BGP neighbor down. |
1. Invoke the NSX CLI command get logical-routers on the affected edge node. |
3.0.1 |
LM To LM Synchronization Warning | Medium | manager | Synchronization between remote locations failed for more than 3 minutes. |
1. Invoke the NSX CLI command get site-replicator remote-sites to get connection state between the remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of leader resolution. In this case, wait for around 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step. |
3.0.1 |
LM To LM Synchronization Error | High | manager | Synchronization between remote locations failed for more than 15 minutes. |
1. Invoke the NSX CLI command get site-replicator remote-sites to get connection state between the remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of leader resolution. In this case, wait for around 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step. |
3.0.1 |
Rtep Connectivity Lost | High | manager | RTEP location connectivity lost. |
1. Invoke the NSX CLI command get logical-routers on the affected edge node (the edge which lost the connectivity) {transport_node_name}. |
3.0.2 |
GM To GM Split Brain | Critical | global-manager | Multiple Global Manager nodes are active at the same time. |
Configure only one Global Manager node as active and all other Global Manager nodes as standby. |
3.1.0 |
GM To GM Latency Warning | Medium | global-manager | Latency between Global Managers is higher than expected for more than 2 minutes |
Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity. |
3.2.0 |
GM To GM Synchronization Warning | Medium | global-manager | Active Global Manager to Standby Global Manager cannot synchronize |
Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) via ping. |
3.2.0 |
GM To GM Synchronization Error | High | global-manager | Active Global Manager to Standby Global Manager cannot synchronize for more than 5 minutes |
Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) via ping. |
3.2.0 |
GM To LM Synchronization Warning | Medium | global-manager, manager | Data synchronization between Global Manager (GM) and Local Manager (LM) failed. |
1. Check the network connectivity between remote site and local site via ping. |
3.2.0 |
GM To LM Synchronization Error | High | global-manager, manager | Data synchronization between Global Manager (GM) and Local Manager (LM) failed for an extended period. |
1. Check the network connectivity between remote site and local site via ping. |
3.2.0 |
Queue Occupancy Threshold Exceeded | Medium | manager, global-manager | Queue occupancy size threshold exceeded warning. |
Queue size can exceed threshold due to communication issue with remote site or an overloaded system. Check system performance and /var/log/async-replicator/ar.log to see if there are any errors reported. |
3.2.0 |
GM To LM Latency Warning | Medium | global-manager, manager | Latency between Global Manager and Local Manager is higher than expected for more than 2 minutes. |
1. Check the network connectivity between remote site and local site via ping. |
3.2.0 |
LM Restore While Config Import In Progress | High | global-manager | Local Manager is restored while config import is in progress on Global Manager. |
1. Log in to NSX Global Manager appliance CLI. |
3.2.0 |
Gateway Firewall Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
IP Flow Count High | Medium | edge, public-cloud-gateway | The gateway firewall flow table usage for IP traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for IP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
IP Flow Count Exceeded | Critical | edge, public-cloud-gateway | The gateway firewall flow table for IP traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for IP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
UDP Flow Count High | Medium | edge, public-cloud-gateway | The gateway firewall flow table usage for UDP traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for UDP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
UDP Flow Count Exceeded | Critical | edge, public-cloud-gateway | The gateway firewall flow table for UDP traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for UDP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
ICMP Flow Count High | Medium | edge, public-cloud-gateway | The gateway firewall flow table usage for ICMP traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for ICMP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
ICMP Flow Count Exceeded | Critical | edge, public-cloud-gateway | The gateway firewall flow table for ICMP traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for ICMP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
Tcp Half Open Flow Count High | Medium | edge, public-cloud-gateway | The gateway firewall flow table usage for TCP half-open traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for TCP half-open flow. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
Tcp Half Open Flow Count Exceeded | Critical | edge, public-cloud-gateway | The gateway firewall flow table for TCP half-open traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for TCP half-open flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
Service Router Limit Per Edge Approaching | Medium | edge, public-cloud-gateway | The number of T0/T1 Service routers or bridges with Gateway Firewall feature enabled per edge node is approaching the maximum limit. |
Reduce the number of gateways/bridges configured on edge node {transport_node_name}. Map additional gateways/bridges to a new edge in the cluster. |
4.2.1 |
Service Router Limit Per Edge Exceeded | Critical | edge, public-cloud-gateway | The number of T0/T1 Service routers or bridges with Gateway Firewall feature enabled per edge has exceeded the maximum limit. |
Reduce the number of gateways/bridges with Gateway Firewall feature enabled configured on edge node {transport_node_name}. Map additional gateways/bridges to a new edge in the cluster. |
4.2.1 |
Rules Limit Per Edge Approaching | Medium | edge, public-cloud-gateway | Total rules per edge is approaching the maximum limit. |
Reduce the number of gateway firewall/bridge/LB rules configured for edge node {transport_node_name}. Log in into the Edge node {transport_node_name} and invoke the NSX CLI command get firewall <interface_uuid> ruleset <rules/stats> to check the number of rules configured for various interfaces. Reduce the number of rules configured for various interfaces. |
4.2.1 |
Rules Limit Per Edge Exceeded | Critical | edge, public-cloud-gateway | Total rules limit per edge has exceeded the maximum limit. |
Reduce the number of gateway firewall/bridge/LB rules configured for edge node {transport_node_name}. Log in into the Edge node {transport_node_name} and invoke the NSX CLI command get firewall <interface_uuid> ruleset <rules/stats> to check the number of rules configured for various interfaces. Reduce the number of rules configured for various interfaces. |
4.2.1 |
Groups Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Group Size Limit Exceeded | Medium | manager | The total number of translated group elements has exceeded the maximum limit. |
1. Consider adjusting group elements in oversized group {group_id}. |
4.1.0 |
Active Directory Groups Modified | Medium | manager | Active Directory Groups are modified on AD server. |
In the NSX UI, navigate to the Inventory | Groups tab to update the group definition of the applicable group with the new base distinguished name. Make sure the group has valid identity group members. |
4.1.2 |
Group Cyclic Topology Detected | High | manager | Cyclic Topologies are detected. |
1. In order to remove the cycle, Consider using MP API and break the relations between groups. |
4.2.0 |
High Availability Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Tier0 Gateway Failover | High | edge, autonomous-edge, public-cloud-gateway | A tier0 gateway has failed over. |
Invoke the NSX CLI command get logical-router <service_router_id> to identify the tier0 service-router vrf ID. Switch to the vrf context by invoking vrf <vrf-id> then invoke get high-availability status to determine the service that is down. |
3.0.0 |
Tier1 Gateway Failover | High | edge, autonomous-edge, public-cloud-gateway | A tier1 gateway has failed over. |
Invoke the NSX CLI command get logical-router <service_router_id> to identify the tier1 service-router vrf ID. Switch to the vrf context by invoking vrf <vrf-id> then invoke get high-availability status to determine the service that is down. |
3.0.0 |
Tier0 Service Group Failover | High | edge, public-cloud-gateway | Service-group does not have an active instance. |
Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for reason for a service-group leaving active state. |
4.0.1 |
Tier1 Service Group Failover | High | edge, public-cloud-gateway | Service-group does not have an active instance. |
Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for reason for a service-group leaving active state. |
4.0.1 |
Tier0 Service Group Reduced Redundancy | Medium | edge, public-cloud-gateway | A standby instance in a service-group has failed. |
Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for failure reason for a previously standby service-group. |
4.0.1 |
Tier1 Service Group Reduced Redundancy | Medium | edge, public-cloud-gateway | A standby instance in a service-group has failed. |
Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for failure reason for a previously standby service-group. |
4.0.1 |
Identity Firewall Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Connectivity To LDAP Server Lost | Critical | manager | Connectivity to LDAP server is lost. |
Check |
3.1.0 |
Error In Delta Sync | Critical | manager | Errors occurred while performing delta sync. |
1. Check if there are any connectivity to LDAP server lost alarms. |
3.1.0 |
IDS IPS Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
IDPS Signature Bundle Download Failure | Medium | manager | Unable to download IDPS signature bundle from NTICS. |
Check if there is internet connectivity from NSX Manager to NTICS. |
4.1.1 |
Maximum Pcaps Reached | Medium | manager | Limit reached for number of PCAPs. |
No manual intervention is required. NSX will automatically remove older PCAP files from the system in alignment with the threshold value. |
4.2.0 |
Pcap Export Purge Failure | Medium | manager | Cleanup of the exported files containing the requested PCAPs failed. |
Delete unused exported tar gz files using the following API -\ DELETE https://<mgr_ip>/api/v1/infra/settings/firewall/security/intrusion-services/pcaps/<exported_tar_gz_id>. |
4.2.0 |
Infrastructure Communication Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Edge Tunnels Down | Critical | edge, public-cloud-gateway | An Edge node's tunnel status is down. |
Invoke the NSX CLI command get tunnel-ports to get all tunnel ports, then check each tunnel's stats by invoking NSX CLI command get tunnel-port <UUID> stats to check if there are any drops. Also check /var/log/syslog if there are tunnel related errors. |
3.0.0 |
GRE Tunnel Down | Critical | edge, autonomous-edge, public-cloud-gateway | GRE tunnel down. |
GRE tunnel goes down when GRE keepalives are not received for dead multiplier times. Check connectivity of GRE endpoints. |
4.1.2 |
Infrastructure Service Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Service Status Unknown On DPU | Critical | dpu | Service's status on DPU is abnormal. |
1. Login the DPU |
4.0.0 |
Service Status Unknown | Critical | esx, kvm, bms, edge, manager, public-cloud-gateway, global-manager | Service's status is abnormal. |
1. Verify if service is still running: ESXi/ESXio: /etc/init.d/{service_name} status EDGE/nsx-manager/linux BMS: systemctl status {service_name} window BMS: Enter powershell and invoke tasklist | findstr nsx-opsagent |
3.1.0 |
Metrics Delivery Failure | Critical | esx, bms, edge, manager, public-cloud-gateway, global-manager | Failed to deliver metrics to the specified target. |
Refer to kb article - https://kb.vmware.com/s/article/95034 |
4.1.0 |
Edge Service Status Down (deprecated) | Critical | edge, autonomous-edge, public-cloud-gateway | Edge service is down for at least one minute. |
On the Edge node, verify the service hasn't exited due to an error by looking for core files in the /var/log/core directory. In addition, invoke the NSX CLI command get services to confirm whether the service is stopped. If so, invoke start service <service-name> to restart the service. |
3.0.0 |
Edge Service Status Changed | Medium | edge, autonomous-edge, public-cloud-gateway | Edge service status has changed. |
On the Edge node, verify the service hasn't exited due to an error by looking for core files in the /var/log/core directory. In addition, invoke the NSX CLI command get services to confirm whether the service is stopped. If so, invoke start service <service-name> to restart the service. |
3.0.0 |
Application Crashed (deprecated) | Critical | global-manager, autonomous-edge, bms, edge, esx, kvm, manager, public-cloud-gateway | Application has crashed and generated a core dump. |
Collect Support Bundle for NSX node {node_display_or_host_name} using NSX Manager UI or API. Note, core dumps can be set to move or copy into NSX Tech Support Bundle in order to remove or preserve the local copy on node. Copy of Support Bundle with core dump files is essential for VMware Support team to troubleshoot the issue and it is best recommended to save a latest copy of Tech Support Bundle including core dump files before removing core dump files from system. Refer KB article for more details. |
4.0.0 |
Application Crashed On DPU (deprecated) | Critical | dpu | Application has crashed and generated a core dump on dpu. |
Collect Support Bundle for DPU {dpu_id} using NSX Manager UI or API. Note, core dumps can be set to move or copy into NSX Tech Support Bundle in order to remove or preserve the local copy on node. Copy of Support Bundle with core dump files is essential for VMware Support team to troubleshoot the issue and it is best recommended to save a latest copy of Tech Support Bundle including core dump files before removing core dump files from system. Refer KB article for more details. |
4.1.1 |
Compute Manager Lost Connectivity | Critical | manager, global-manager | Compute Manager connection status is down. |
Check the errors present for Compute Manager {cm_name} having id {cm_id} and resolve the errors. |
4.1.2 |
Intelligence Communication Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
TN Flow Exporter Disconnected (deprecated) | High | esx, kvm, bms | A Transport node is disconnected from its NSX Messaging Broker. |
Restart the messaging service if it is not running. Resolve the network connection failure between the Transport node flow exporter and its NSX messaging broker. |
3.0.0 |
Intelligence Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
CPU Usage Very High (deprecated) | Critical | manager, intelligence | Intelligence node CPU usage is very high. |
Use the top command to check which processes have the most CPU usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
3.0.0 |
CPU Usage High (deprecated) | Medium | manager, intelligence | Intelligence node CPU usage is high. |
Use the top command to check which processes have the most CPU usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
3.0.0 |
Memory Usage Very High (deprecated) | Critical | manager, intelligence | Intelligence node memory usage is very high. |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
3.0.0 |
Memory Usage High (deprecated) | Medium | manager, intelligence | Intelligence node memory usage is high. |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
3.0.0 |
Disk Usage Very High (deprecated) | Critical | manager, intelligence | Intelligence node disk usage is very high. |
Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed. |
3.0.0 |
Disk Usage High (deprecated) | Medium | manager, intelligence | Intelligence node disk usage is high. |
Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed. |
3.0.0 |
Data Disk Partition Usage Very High (deprecated) | Critical | manager, intelligence | Intelligence node data disk partition usage is very high. |
Stop NSX intelligence data collection until the disk usage is below the threshold. In the NSX UI, navigate to System | Appliances | NSX Intelligence Appliance. Then click ACTONS, Stop Collecting Data. |
3.0.0 |
Data Disk Partition Usage High (deprecated) | Medium | manager, intelligence | Intelligence node data disk partition usage is high. |
Stop NSX intelligence data collection until the disk usage is below the threshold. Examine disk partition /data and see if there are any unexpected large files that can be removed. |
3.0.0 |
Storage Latency High (deprecated) | Medium | manager, intelligence | Intelligence node storage latency is high. |
Transient high storage latency may happen due to spike of I/O requests. If storage latency remains high for more than 30 minutes, consider deploying NSX Intelligence appliance in a low latency disk, or not sharing the same storage device with other VMs. |
3.1.0 |
Node Status Degraded (deprecated) | High | manager, intelligence | Intelligence node status is degraded. |
Invoke the NSX API GET /napp/api/v1/platform/monitor/category/health to check which specific pod is down and the reason behind it. Invoke the following CLI command to restart the degraded service: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> |
3.0.0 |
IPAM Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
IP Block Usage Very High | Medium | manager | IP block usage is very high. |
Review IP block usage. Use new IP block for resource creation or delete unused IP subnet from the IP block. To check subnet being used for IP Block. From NSX UI, navigate to Networking | IP Address pools | IP Address pools tab. Select IP pools where IP block being used, check Subnets and Allocated IPs column on UI. If no allocation has been used for the IP pool and it is not going to be used in future then delete subnet or IP pool. Use following API to check if IP block being used by IP pool and also check if any IP allocation done: To get configured subnets of an IP pool, invoke the NSX API GET /policy/api/v1/infra/ip-pools/<ip-pool>/ip-subnets To get IP allocations, invoke the NSX API GET /policy/api/v1/infra/ip-pools/<ip-pool>/ip-allocations Note: Deletion of IP pool/subnet should only be done if it does not have any allocated IPs and it is not going to be used in future. |
3.1.2 |
IP Pool Usage Very High | Medium | manager | IP pool usage is very high. |
Review IP pool usage. Release unused ip allocations from IP pool or create new IP pool and use it. From NSX UI navigate to Networking | IP Address pools | IP Address pools tab. Select IP pools and check Allocated IPs column, this will show IPs allocated from the IP pool. If user see any IPs are not being used then those IPs can be released. To release unused IP allocations, invoke the NSX API DELETE /policy/api/v1/infra/ip-pools/<ip-pool>/ip-allocations/<ip-allocation> |
3.1.2 |
Licenses Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
SHA Metering Plugin Down (deprecated) | Critical | manager | License SHA metering plugin on ESXi host is down or unhealthy. |
To check license SHA metering plugin status, invoke the NSX API GET /api/v1/systemhealth/plugins/status/{transport_node_id}. If there is no data in response, it means the connection between NSX Manager and ESXi host is broken or SHA process in ESXi host is down. If there is plugin status in response, locate plugin status by name license_metering_monitor and check content in detail. To restore license SHA metering plugin in ESXi host, log into the ESXi host and restart SHA process by invoking the command /etc/init.d/netopa restart. |
4.1.2 |
License Expired | Critical | global-manager, manager | A license has expired. |
Add a new, non-expired license using the NSX UI by navigating to System | Licenses then click ADD and specify the key of the new license. The expired license should be deleted by checking the checkbox of the license, then click DELETE. |
3.0.0 |
License Is About To Expire | Medium | global-manager, manager | A license is about to expired. |
The license is about to expire in several days. Plan to add a new, non-expiring license using the NSX UI by navigating to System | Licenses then click ADD and specify the key of the new license. The expired license should be deleted by checking the checkbox of the license, then click DELETE. |
3.0.0 |
Load Balancer Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
LB CPU Very High | Medium | edge | Load balancer CPU usage is very high. |
If the load balancer CPU utilization is higher than system usage threshold, the workload is too high for this load balancer. Rescale the load balancer service by changing the load balancer size from small to medium or from medium to large. If the CPU utilization of this load balancer is still high, consider adjusting the Edge appliance form factor size or moving load balancer services to other Edge nodes for the applicable workload. |
3.0.0 |
LB Status Degraded | Medium | manager | Load balancer service is degraded. |
For centralized load balancer: Check the load balancer status on standby Edge node as the degraded status means the load balancer status on standby Edge node is not ready. On standby Edge node, invoke the NSX CLI command get load-balancer <lb-uuid> status. If the LB-State of load balancer service is not_ready or there is no output, make the Edge node enter maintenance mode, then exit maintenance mode. For distributed load balancer: |
3.1.2 |
DLB Status Down | Critical | manager | Distributed load balancer service is down. |
On ESXi host node, invoke the NSX CLI command `get load-balancer <lb-uuid> status`. If 'Conflict LSP' is reported, check whether this LSP is attached to other load balancer service. Check whether this conflict is acceptable. If 'Not Ready LSP' is reported, check the status of this LSP by invoking NSX CLI command get logical-switch-port status. |
3.1.2 |
LB Status Down | Critical | edge | Centralized load balancer service is down. |
On active Edge node, check load balancer status by invoking the NSX CLI command get load-balancer <lb-uuid> status. If the LB-State of load balancer service is not_ready or there is no output, make the Edge node enter maintenance mode, then exit maintenance mode. |
3.0.0 |
Virtual Server Status Down | Medium | edge | Load balancer virtual service is down. |
Consult the load balancer pool to determine its status and verify its configuration. If incorrectly configured, reconfigure it and remove the load balancer pool from the virtual server then re-add it to the virtual server again. |
3.0.0 |
Pool Status Down | Medium | edge | Load balancer pool is down. |
Consult the load balancer pool to determine which members are down by invoking the NSX CLI command get load-balancer <lb-uuid> pool <pool-uuid> status or NSX API GET /policy/api/v1/infra/lb-services/<lb-service-id>/lb-pools/<lb-pool-id>/detailed-status If DOWN or UNKNOWN is reported, verify the pool member. Check network connectivity from the load balancer to the impacted pool members. Validate application health of each pool member. Also validate the health of each pool member using the configured monitor. When the health of the member is established, the pool member status is updated to healthy based on the 'Rise Count' configuration in the monitor. Remediate the issue by rebooting the pool member or make the Edge node enter maintenance mode, then exit maintenance mode. |
3.0.0 |
LB Edge Capacity In Use High | Medium | edge | Load balancer usage is high. |
If multiple LB instances have been configurerd in this Edge node, deploy a new Edge node and move some LB instances to that new Edge node. If only a single LB instance (small/medium/etc) has been configured in an Edge node of same size (small/medium/etc), deploy a new Edge of bigger size and move the LB instance to that new Edge node. |
3.1.2 |
LB Pool Member Capacity In Use Very High | Critical | edge | Load balancer pool member usage is very high. |
Deploy a new Edge node and move the load balancer service from existing Edge nodes to the newly deployed Edge node. |
3.1.2 |
Load Balancing Configuration Not Realized Due To Lack Of Memory | Medium | edge | Load balancer configuration is not realized due to high memory usage on Edge node. |
Prefer defining small and medium sized load balancers over large sized load balancers. Spread out load balancer services among the available Edge nodes. Reduce number of Virtual Servers defined. |
3.2.0 |
Logging Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Log Retention Time Too Low | Info | esx, edge, manager, public-cloud-gateway, global-manager | Log files will be deleted before the set retention period. |
Follow the below steps to back up the log files before they are deleted. 1. Get the detailed report of log files on the node: {report_file_path}. 2. Review the Estimated Maximum Duration and Desired Duration in the detailed report, Estimated Maximum duration will indicate if the log files will be deleted before the retention period indicated by Desired Duration. If needed, backup old log files. |
4.1.1 |
Remote Logging Not Configured | Medium | global-manager, manager | Remote logging not configured. |
1. To determine whether NSX Manager, Edge, ESXi nodes have remote logging server configured or not, invoke API GET /api/v1/configs/central-config/logging-servers in NSX Manager nodes. To determine whether Global Manager nodes have remote logging server configured or not, invoke the NSX CLI get logging-servers on each Global Manager node. |
4.1.2 |
Malware Prevention Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Service Status Down | High | manager | Service status is down. |
On the {transport_node_type} transport node identified by {transport_node_name}, invoke the NSX CLI get services to check the status of {mps_service_name}. Inspect /var/log/syslog to find any suspecting error(s). Refer sections for {transport_node_type} transport node in KB. |
4.0.1 |
File Extraction Service Unreachable | High | manager | Service status is degraded. |
On the {transport_node_type} transport node identified by {transport_node_name}, check the status of {mps_service_name} that is responsible for file_extraction. Inspect /var/log/syslog to find any suspecting error(s). Refer sections for {transport_node_type} transport node in KB. |
4.0.1 |
Database Unreachable | High | manager | Service status is degraded. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is down and the reason behind it. Invoke the following CLI command to restart the degraded service: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> Determine the status of Malware Prevention Database service. |
4.0.1 |
Analyst API Service Unreachable | High | manager | Service status is degraded. |
Analyst API service external to datacenter is unreachable. Check connectivity to internet. This could be temporary and may restore on its own. If this doesn't happen in minutes then it is advisable to collect the NSX Application platform support bundle and raise a support ticket with VMware support team. |
4.0.1 |
NTICS Reputation Service Unreachable | High | manager | Service status is degraded. |
NTICS service external to datacenter is unreachable. Check connectivity to internet. This could be temporary and may restore on its own. |
4.1.0 |
Service Disk Usage Very High | High | manager | Service disk usage is very high. |
On the {transport_node_type} transport node identified by {transport_node_name}, you may reduce the file retention period or alternatively in case of host node, reduce the Malware Prevention load by moving some VMs to another Host node. Refer sections for {transport_node_type} transport node in KB. |
4.1.2 |
Service Disk Usage High | Medium | manager | Service disk usage is high. |
On the {transport_node_type} transport node identified by {transport_node_name}, you may reduce the file retention period or alternatively in case of host node, reduce the Malware Prevention load by moving some VMs to another Host node. Refer sections for {transport_node_type} transport node in KB. |
4.1.2 |
Service VM CPU Usage High | Medium | manager | Malware Prevention Service VM CPU usage is high. |
Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM. |
4.1.2 |
Service VM CPU Usage Very High | High | manager | Malware Prevention Service VM CPU usage is very high. |
Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM. |
4.1.2 |
Service VM Memory Usage High | Medium | manager | Malware Prevention Service VM memory usage is high. |
Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM. |
4.1.2 |
Service VM Memory Usage Very High | High | manager | Malware Prevention Service VM memory usage is very high. |
Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM. |
4.1.2 |
Manager Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Manager CPU Usage Very High | Critical | global-manager, manager | Manager node CPU usage is very high. |
Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
3.0.0 |
Manager CPU Usage High | Medium | global-manager, manager | Manager node CPU usage is high. |
Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
3.0.0 |
Manager Memory Usage Very High | Critical | global-manager, manager | Manager node memory usage is very high. |
Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
3.0.0 |
Manager Memory Usage High | Medium | global-manager, manager | Manager node memory usage is high. |
Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
3.0.0 |
Manager Disk Usage Very High | Critical | global-manager, manager | Manager node disk usage is very high. |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
3.0.0 |
Manager Disk Usage High | Medium | global-manager, manager | Manager node disk usage is high. |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
3.0.0 |
Manager Config Disk Usage Very High | Critical | global-manager, manager | Manager node config disk usage is very high. |
Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py |
3.0.0 |
Manager Config Disk Usage High | Medium | global-manager, manager | Manager node config disk usage is high. |
Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py |
3.0.0 |
Operations Db Disk Usage Very High | Critical | manager | Manager node nonconfig disk usage is very high. |
Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig |
3.0.1 |
Operations Db Disk Usage High | Medium | manager | Manager node nonconfig disk usage is high. |
Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig |
3.0.1 |
Operations Repository Disk Usage Very High | Critical | global-manager, manager | Manager node repository disk usage is very high. |
Please contact GSS to avoid any potential issue. |
4.2.0 |
Duplicate IP Address | Medium | manager | Manager node's IP address is in use by another device. |
1. Determine which device is using the Manager's IP address and assign the device a new IP address. Note, reconfiguring the Manager to use a new IP address is not supported. |
3.0.0 |
Storage Error | Critical | global-manager, manager | Manager node disk is read-only. |
Examine the read-only partition to see if reboot resolves the issue or the disk needs to be replaced. |
3.0.2 |
Missing DNS Entry For Manager FQDN | Critical | global-manager, manager | The DNS entry for the Manager FQDN is missing. |
1. Ensure proper DNS servers are configured in the Manager node. |
4.1.0 |
Missing DNS Entry For Vip FQDN | Critical | manager | Missing FQDN entry for the Manager VIP. |
Examine the DNS entry for the VIP addresses to see if they resolve to the same FQDN. |
4.1.0 |
Monitoring Framework Unhealthy | Medium | global-manager, manager | Monitoring Framework on Manager node is unhealthy. |
1. On problematic Manager node, invoke the NSX CLI command restart service node-stats. |
4.2.0 |
Different Manager IP Configuration In Cluster | Critical | global-manager, manager | Not all NSX Managers in the cluster have the same IPv4 and/or IPv6 address families configuration. |
1. Invoke the NSX CLI command 'get cluster status' to view the status. |
4.2.0 |
MTU Check Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
MTU Mismatch Within Transport Zone | High | manager | MTU configuration mismatch between Transport Nodes attached to the same Transport Zone. |
1. Navigate to System | Fabric | Settings | MTU Configuration Check | Inconsistent on the NSX UI to check more mismatch details. |
3.2.0 |
Global Router MTU Too Big | Medium | manager | The global router MTU configuration is bigger than the MTU of overlay Transport Zone. |
1. Navigate to System | Fabric | Settings | MTU Configuration Check | Inconsistent on the NSX UI to check more mismatch details. |
3.2.0 |
NAT Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
SNAT Port Usage On Gateway Is High | Critical | edge, public-cloud-gateway | SNAT port usage on the Gateway is high. |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> connection state by using the right interface uuid and check various SNAT mappings for the SNAT IP {snat_ip_address}. Check traffic flows going through the gateway is not a denial-of-service attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider adding more SNAT IP addresses to distribute the load or route new traffic to another Edge node. |
3.2.0 |
NCP Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
NCP Plugin Down | Critical | manager | Detect and report when the NCP plugin is down. |
To find the clusters which are having issues, use the NSX UI and navigate to the Alarms page. The Entity name value for this alarm instance identifies the cluster name. Or invoke the NSX API GET /api/v1/systemhealth/container-cluster/ncp/status to fetch all cluster statuses and determine the name of any clusters that report DOWN or UNKNOWN. Then on the NSX UI Inventory | Container | Clusters page find the cluster by name and click the Nodes tab which lists all Kubernetes, OpenShift, SupervisorCluster, and TKGi/TAS cluster members. For Kubernetes, OpenShift, SupervisorCluster cluster: |
3.0.0 |
Node Agents Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Node Agents Down On DPU | High | dpu | Detect and report when the connection between nsx-node-agent and hyperbus is down on DPU. |
For Kubernetes/OpenShift cluster: 1.On K8s Master VM, check the connection between the nsx-node-agent container and hyperbus is down or not. And then invoke kubectl command to check connection status. kubectl exec -it <nsx-node-agent-Pod-Name> -n <nsx-node-agent-Pod-NameSpace> -c nsx-node-agent bash nsxcli get node-agent-hyperbus status For TKGi/TAS cluster: |
4.0.0 |
Node Agents Down | High | esx, kvm | Detect and report when the connection between nsx-node-agent and hyperbus is down. |
For Kubernetes/OpenShift cluster: 1.On K8s Master VM, check the connection between the nsx-node-agent container and hyperbus is down or not. And then invoke kubectl command to check connection status. kubectl exec -it <nsx-node-agent-Pod-Name> -n <nsx-node-agent-Pod-NameSpace> -c nsx-node-agent bash nsxcli get node-agent-hyperbus status For TKGi/TAS cluster: |
3.0.0 |
NSX Application Config Agent Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Config Agent Unhealthy | Critical | manager | Agent that sends configuration updates to NAPP/SSP is unhealthy. |
Refer to KB article https://knowledge.broadcom.com/external/article ?articleNumber=373834 |
4.2.1 |
NSX Application Platform Communication Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Manager Disconnected | High | manager, intelligence | The NSX Application Platform cluster is disconnected from the NSX management cluster. |
Check whether the manager cluster certificate, manager node certificates, kafka certificate and ingress certificate match on both NSX Manager and the NSX Application Platform cluster. Check expiration dates of the above mentioned certificates to make sure they are valid. Check the network connection between NSX Manager and NSX Application Platform cluster and resolve any network connection failures. |
3.2.0 |
Delay Detected In Messaging Rawflow | Critical | manager, intelligence | Slow data processing detected in messaging topic Raw Flow. |
Add nodes and then scale up the NSX Application Platform cluster. If the bottleneck can be attributed to a specific service, for example, the analytics service, then scale up the specific service when the new nodes are added. If you are unable to scaleout the cluster immediately, then you can try one of the other options in this KB https://kb.vmware.com/s/article/91932 . |
3.2.0 |
Delay Detected In Messaging Overflow | Critical | manager, intelligence | Slow data processing detected in messaging topic Over Flow. |
Add nodes and then scale up the NSX Application Platform cluster. If bottleneck can be attributed to a specific service, for example, the analytics service, then scale up the specific service when the new nodes are added. If you are unable to scaleout the cluster immediately, then you can try one of the other options in this KB https://kb.vmware.com/s/article/91932 . |
3.2.0 |
TN Flow Exp Disconnected | High | esx, kvm, bms | A Transport node is disconnected from its NSX Messaging Broker. |
Restart the messaging service if it is not running. Resolve the network connection failure between the Transport node flow exporter and its NSX messaging broker. |
3.2.0 |
TN Flow Exp Disconnected On DPU | High | dpu | A Transport node is disconnected from its NSX messaging broker. |
Restart the messaging service if it is not running. Resolve the network connection failure between the Transport node flow exporter and its NSX messaging broker. |
4.0.0 |
NSX Application Platform Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Cluster CPU Usage Very High | Critical | manager, intelligence | NSX Application Platform cluster CPU usage is very high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if the load can be reduced. If more computing power is required, click on the Scale Out button to request more resources. |
3.2.0 |
Cluster CPU Usage High | Medium | manager, intelligence | NSX Application Platform cluster CPU usage is high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if the load can be reduced. If more computing power is required, click on the Scale Out button to request more resources. |
3.2.0 |
Cluster Memory Usage Very High | Critical | manager, intelligence | NSX Application Platform cluster memory usage is very high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if the load can be reduced. If more memory is required, click on the Scale Out button to request more resources. |
3.2.0 |
Cluster Memory Usage High | Medium | manager, intelligence | NSX Application Platform cluster memory usage is high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if the load can be reduced. If more memory is required, click on the Scale Out button to request more resources. |
3.2.0 |
Cluster Disk Usage Very High | Critical | manager, intelligence | NSX Application Platform cluster disk usage is very high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. See if the load can be reduced. If more disk storage is required, click on the Scale Out button to request more resources. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size. |
3.2.0 |
Cluster Disk Usage High | Medium | manager, intelligence | NSX Application Platform cluster disk usage is high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. See if the load can be reduced. If more disk storage is required, click on the Scale Out button to request more resources. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size. |
3.2.0 |
Napp Status Degraded | Medium | manager, intelligence | NSX Application Platform cluster overall status is degraded. |
Get more information from alarms of nodes and services. |
3.2.0 |
Napp Status Down | High | manager, intelligence | NSX Application Platform cluster overall status is down. |
Get more information from alarms of nodes and services. |
3.2.0 |
Node CPU Usage Very High | Critical | manager, intelligence | NSX Application Platform node CPU usage is very high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high CPU usage, by default, Kubernetes will reschedule services automatically. If most nodes have high CPU usage and load cannot be reduced, click on the Scale Out button to request more resources. |
3.2.0 |
Node CPU Usage High | Medium | manager, intelligence | NSX Application Platform node CPU usage is high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high CPU usage, by default, Kubernetes will reschedule services automatically. If most nodes have high CPU usage and load cannot be reduced, click on the Scale Out button to request more resources. |
3.2.0 |
Node Memory Usage Very High | Critical | manager, intelligence | NSX Application Platform node memory usage is very high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high Memory usage, by default, Kubernetes will reschedule services automatically. If most nodes have high Memory usage and load cannot be reduced, click on the Scale Out button to request more resources. |
3.2.0 |
Node Memory Usage High | Medium | manager, intelligence | NSX Application Platform node memory usage is high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high Memory usage, by default, Kubernetes will reschedule services automatically. If most nodes have high Memory usage and load cannot be reduced, click on the Scale Out button to request more resources. |
3.2.0 |
Node Disk Usage Very High | Critical | manager, intelligence | NSX Application Platform node disk usage is very high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. Clean up unused data or log to free up disk resources and see if the load can be reduced. If more disk storage is required, Scale Out the service under pressure. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size. |
3.2.0 |
Node Disk Usage High | Medium | manager, intelligence | NSX Application Platform node disk usage is high. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. Clean up unused data or log to free up disk resources and see if the load can be reduced. If more disk storage is required, Scale Out the service under pressure. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size. |
3.2.0 |
Node Status Degraded | Medium | manager, intelligence | NSX Application Platform node status is degraded. |
In the NSX UI, navigate to System | NSX Application Platform | Resources to check which node is degraded. Check network, memory and CPU usage of the node. Reboot the node if it is a worker node. |
3.2.0 |
Node Status Down | High | manager, intelligence | NSX Application Platform node status is down. |
In the NSX UI, navigate to System | NSX Application Platform | Resources to check which node is down. Check network, memory and CPU usage of the node. Reboot the node if it is a worker node. |
3.2.0 |
Datastore CPU Usage Very High | Critical | manager, intelligence | Data Storage service CPU usage is very high. |
Scale out all services or the Data Storage service. |
3.2.0 |
Datastore CPU Usage High | Medium | manager, intelligence | Data Storage service CPU usage is high. |
Scale out all services or the Data Storage service. |
3.2.0 |
Messaging CPU Usage Very High | Critical | manager, intelligence | Messaging service CPU usage is very high. |
Scale out all services or the Messaging service. |
3.2.0 |
Messaging CPU Usage High | Medium | manager, intelligence | Messaging service CPU usage is high. |
Scale out all services or the Messaging service. |
3.2.0 |
Configuration Db CPU Usage Very High | Critical | manager, intelligence | Configuration Database service CPU usage is very high. |
Scale out all services. |
3.2.0 |
Configuration Db CPU Usage High | Medium | manager, intelligence | Configuration Database service CPU usage is high. |
Scale out all services. |
3.2.0 |
Metrics CPU Usage Very High | Critical | manager, intelligence | Metrics service CPU usage is very high. |
Scale out all services. |
3.2.0 |
Metrics CPU Usage High | Medium | manager, intelligence | Metrics service CPU usage is high. |
Scale out all services. |
3.2.0 |
Analytics CPU Usage Very High | Critical | manager, intelligence | Analytics service CPU usage is very high. |
Scale out all services or the Analytics service. |
3.2.0 |
Analytics CPU Usage High | Medium | manager, intelligence | Analytics service CPU usage is high. |
Scale out all services or the Analytics service. |
3.2.0 |
Platform CPU Usage Very High | Critical | manager, intelligence | Platform Services service CPU usage is very high. |
Scale out all services. |
3.2.0 |
Platform CPU Usage High | Medium | manager, intelligence | Platform Services service CPU usage is high. |
Scale out all services. |
3.2.0 |
Datastore Memory Usage Very High | Critical | manager, intelligence | Data Storage service memory usage is very high. |
Scale out all services or the Data Storage service. |
3.2.0 |
Datastore Memory Usage High | Medium | manager, intelligence | Data Storage service memory usage is high. |
Scale out all services or the Data Storage service. |
3.2.0 |
Messaging Memory Usage Very High | Critical | manager, intelligence | Messaging service memory usage is very high. |
Scale out all services or the Messaging service. |
3.2.0 |
Messaging Memory Usage High | Medium | manager, intelligence | Messaging service memory usage is high. |
Scale out all services or the Messaging service. |
3.2.0 |
Configuration Db Memory Usage Very High | Critical | manager, intelligence | Configuration Database service memory usage is very high. |
Scale out all services. |
3.2.0 |
Configuration Db Memory Usage High | Medium | manager, intelligence | Configuration Database service memory usage is high. |
Scale out all services. |
3.2.0 |
Metrics Memory Usage Very High | Critical | manager, intelligence | Metrics service memory usage is very high. |
Scale out all services. |
3.2.0 |
Metrics Memory Usage High | Medium | manager, intelligence | Metrics service memory usage is high. |
Scale out all services. |
3.2.0 |
Analytics Memory Usage Very High | Critical | manager, intelligence | Analytics service memory usage is very high. |
Scale out all services or the Analytics service. |
3.2.0 |
Analytics Memory Usage High | Medium | manager, intelligence | Analytics service memory usage is high. |
Scale out all services or the Analytics service. |
3.2.0 |
Platform Memory Usage Very High | Critical | manager, intelligence | Platform Services service memory usage is very high. |
Scale out all services. |
3.2.0 |
Platform Memory Usage High | Medium | manager, intelligence | Platform Services service memory usage is high. |
Scale out all services. |
3.2.0 |
Datastore Disk Usage Very High | Critical | manager, intelligence | Data Storage service disk usage is very high. |
Scale out or scale up the data storage service. |
3.2.0 |
Datastore Disk Usage High | Medium | manager, intelligence | Data Storage service disk usage is high. |
Scale out or scale up the data storage service. |
3.2.0 |
Messaging Disk Usage Very High | Critical | manager, intelligence | Messaging service disk usage is very high. |
Clean up files not needed. Scale out all services or the Messaging service. |
3.2.0 |
Messaging Disk Usage High | Medium | manager, intelligence | Messaging service disk usage is high. |
Clean up files not needed. Scale out all services or the Messaging service. |
3.2.0 |
Configuration Db Disk Usage Very High | Critical | manager, intelligence | Configuration Database service disk usage is very high. |
Clean up files not needed. Scale out all services. |
3.2.0 |
Configuration Db Disk Usage High | Medium | manager, intelligence | Configuration Database service disk usage is high. |
Clean up files not needed. Scale out all services. |
3.2.0 |
Metrics Disk Usage Very High | Critical | manager, intelligence | Metrics service disk usage is very high. |
Follow the steps at https://kb.vmware.com/s/article/93274 |
3.2.0 |
Metrics Disk Usage High | Medium | manager, intelligence | Metrics service disk usage is high. |
Follow the steps at https://kb.vmware.com/s/article/93274 |
3.2.0 |
Analytics Disk Usage Very High | Critical | manager, intelligence | Analytics service disk usage is very high. |
Clean up files not needed. Scale out all services or the Analytics service. |
3.2.0 |
Analytics Disk Usage High | Medium | manager, intelligence | Analytics service disk usage is high. |
Clean up files not needed. Scale out all services or the Analytics service. |
3.2.0 |
Platform Disk Usage Very High | Critical | manager, intelligence | Platform Services service disk usage is very high. |
Clean up files not needed. Scale out all services. |
3.2.0 |
Platform Disk Usage High | Medium | manager, intelligence | Platform Services service disk usage is high. |
Invoke the command to get disk usage : napp-k exec -it $(napp-k get pods | grep cluster | cut -d ' ' -f 1) -c cluster-api -- sh -c 'kubectl df-pv' Clean up files not needed. Scale out all services. |
3.2.0 |
Service Status Degraded | Medium | manager, intelligence | Service status is degraded. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is degraded and the reason behind it. Invoke the following CLI command to restart the degraded service if necessary: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> Degraded services can function correctly but performance is sub-optimal. |
3.2.0 |
Service Status Down | High | manager, intelligence | Service status is down. |
In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is down and the reason behind it. Follow the steps at https://kb.vmware.com/s/article/96890 |
3.2.0 |
Flow Storage Growth High | Medium | manager, intelligence | Analytics and Data Storage disk usage is growing faster than expected. |
Connect less transport nodes or set narrower private IP ranges to reduce the number of unique flows. Filter out broadcast and/or multcast flows. Scale out Analytics and Data Storage services to get more storage. |
4.1.1 |
Password Management Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Password Expired | Critical | global-manager, manager, edge, public-cloud-gateway | User password has expired. |
The password for user {username} must be changed now to access the system. For example on Manager nodes, to apply a new password to a user, invoke the following NSX API with a valid password in the request body: PUT /api/v1/node/users/<userid> where <userid> is the ID of the user. For Edge nodes, invoke the following NSX API with a valid password in the request body: PUT /api/v1/transport-nodes/<edge-node-id>/node/users/<userid>. If the admin user (with <userid> 10000) password has expired, admin must login to the system via SSH (if enabled) or console in order to change the password. Upon entering the current expired password, admin will be prompted to enter a new password. |
3.0.0 |
Password Is About To Expire | High | global-manager, manager, edge, public-cloud-gateway | User password is about to expire. |
Ensure the password for the user {username} is changed immediately. For example, to apply a new password to a user on Manager nodes, invoke the following NSX API with a valid password in the request body: PUT /api/v1/node/users/<userid> where <userid> is the ID of the user. For Edge nodes, invoke the following NSX API with a valid password in the request body: PUT /api/v1/transport-nodes/<edge-node-id>/node/users/<userid>. |
3.0.0 |
Password Expiration Approaching | Medium | global-manager, manager, edge, public-cloud-gateway | User password is approaching expiration. |
The password for the user {username} needs to be changed soon. For example, to apply a new password to a user on Manager nodes, invoke the following NSX API with a valid password in the request body: PUT /api/v1/node/users/<userid> where <userid> is the ID of the user. For Edge nodes, invoke the following NSX API with a valid password in the request body: PUT /api/v1/transport-nodes/<edge-node-id>/node/users/<userid>. |
3.0.0 |
Physical Server Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Physical Server Install Failed | Critical | manager | Physical Server (BMS) installation failed. |
Navigate to System > Fabric > Hosts > Standalone and resolve the error on the node. |
4.0.0 |
Physical Server Upgrade Failed | Critical | manager | Physical Server (BMS) upgrade failed. |
Navigate to System > Upgrade and resolve the error, then re-trigger the upgrade. |
4.0.0 |
Physical Server Uninstall Failed | Critical | manager | Physical Server (BMS) uninstallation failed. |
Navigate to System > Fabric > Hosts > Standalone and resolve the error on the node. |
4.0.0 |
Policy Constraint Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Creation Count Limit Reached | Medium | manager | Entity count has reached the policy constraint limit. |
Review {constraint_type} usage. Update the constraint to increase the limit or delete unused {constraint_type}. Refer to https://kb.vmware.com/s/article/92351 for how to manage constraints and their limits. |
4.1.0 |
Creation Count Limit Reached For Project | Medium | manager | Entity count has reached the policy constraint limit. |
Review {constraint_type} usage. Update the constraint to increase the limit or delete unused {constraint_type}. |
4.1.1 |
Routing Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
BFD Down On External Interface | High | edge, autonomous-edge, public-cloud-gateway | BFD session on the external interface is down. |
Perform a ping test to verify connectivity. Follow the steps provided below to resolve the alarm. |
3.0.0 |
Static Routing Removed | High | edge, autonomous-edge, public-cloud-gateway | Static route removed because its configured BFD session is down. |
Perform a ping test to verify connectivity. Follow the steps provided below to resolve the alarm. |
3.0.0 |
BGP Down | High | edge, autonomous-edge, public-cloud-gateway | BGP neighborship is down. |
Perform a ping test to verify connectivity. Follow the steps provided below to resolve the alarm. |
3.0.0 |
Proxy ARP Not Configured For Service IP | Critical | manager | Proxy ARP is not configured for Service IP. |
Reconfigure the Service IP {service_ip} for the Service entity {entity_id} or change the subnet of the lrport {lrport_id} on Router {lr_id} so that the proxy ARP entries generated due to the overlap between the Service IP and the subnet of lrport is less than the allowed threshold limit of 16384. |
3.0.3 |
Routing Down | High | edge, autonomous-edge, public-cloud-gateway | The edge is unable to communicate with the external domain. |
1. If the failure reason suggests that FRR state is down, alarm will get cleared once the FRR processes are up and routing state is good. |
3.0.0 |
OSPF Neighbor Went Down | High | edge, autonomous-edge, public-cloud-gateway | OSPF neighbor moved from full to another state. |
In an attempt to resolve the Alarm, perform a ping test to verify connectivity, follow the steps below : |
3.1.1 |
Maximum IPv4 Route Limit Approaching | Medium | edge, autonomous-edge, public-cloud-gateway | Maximum IPv4 Routes limit is approaching on Edge node. |
1. Check route redistribution policies and routes received from all external peers. |
4.0.0 |
Maximum IPv6 Route Limit Approaching | Medium | edge, autonomous-edge, public-cloud-gateway | Maximum IPv6 Routes limit is approaching on Edge node. |
1. Check route redistribution policies and routes received from all external peers. |
4.0.0 |
Maximum IPv4 Route Limit Exceeded | Critical | edge, autonomous-edge, public-cloud-gateway | Maximum IPv4 Routes limit has exceeded on Edge node. |
1. Check route redistribution policies and routes received from all external peers. |
4.0.0 |
Maximum IPv6 Route Limit Exceeded | Critical | edge, autonomous-edge, public-cloud-gateway | Maximum IPv6 Routes limit has exceeded on Edge node. |
1. Check route redistribution policies and routes received from all external peers. |
4.0.0 |
Maximum IPv4 Prefixes From BGP Neighbor Approaching | Medium | edge, autonomous-edge, public-cloud-gateway | Maximum IPv4 Prefixes received from BGP neighbor is approaching. |
1. Check the BGP routing policies in the external router. |
4.0.0 |
Maximum IPv6 Prefixes From BGP Neighbor Approaching | Medium | edge, autonomous-edge, public-cloud-gateway | Maximum IPv6 Prefixes received from BGP neighbor is approaching. |
1. Check the BGP routing policies in the external router. |
4.0.0 |
Maximum IPv4 Prefixes From BGP Neighbor Exceeded | Critical | edge, autonomous-edge, public-cloud-gateway | Maximum IPv4 Prefixes received from BGP neighbor has exceeded. |
1. Check the BGP routing policies in the external router. |
4.0.0 |
Maximum IPv6 Prefixes From BGP Neighbor Exceeded | Critical | edge, autonomous-edge, public-cloud-gateway | Maximum IPv6 Prefixes received from BGP neighbor has exceeded. |
1. Check the BGP routing policies in the external router. |
4.0.0 |
Security Compliance Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Trigger NDcPP Non-Compliance | Critical | manager | The NSX security status is not NDcPP compliant. |
Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the NDcPP compliance name. |
4.1.0 |
Trigger EAL4 Non-Compliance | Critical | manager | The NSX security status is not EAL4+ compliant. |
Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the EAL4+ compliance name. |
4.1.0 |
Poll NDcPP Non-Compliance | Critical | manager | The NSX security configuration is not NDcPP compliant. |
Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the NDcPP compliance name. |
4.1.0 |
Poll EAL4 Non-Compliance | Critical | manager | The NSX security configuration is not EAL4+ compliant. |
Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the EAL4+ compliance name. |
4.1.0 |
Service Insertion Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Service Deployment Succeeded | Info | manager | Service deployment succeeded. |
No action needed. |
4.0.0 |
Service Deployment Failed | Critical | manager | Service deployment failed. |
Delete the service deployment using NSX UI or API. Perform any corrective action from the KB and retry service deployment again. |
4.0.0 |
Service Undeployment Succeeded | Info | manager | Service deployment deletion succeeded. |
No action needed. |
4.0.0 |
Service Undeployment Failed | Critical | manager | Service deployment deletion failed. |
Delete the service deployment using NSX UI or API. Perform any corrective action from the KB and retry deleting the service deployment again. Resolve the alarm manually after checking all the VM and objects are deleted. |
4.0.0 |
SVM Health Status Up | Info | manager | SVM is working in service. |
No action needed. |
4.0.0 |
SVM Health Status Down | High | manager | SVM is not working in service. |
Delete the service deployment using NSX UI or API. Perform any corrective action from the KB and retry service deployment again if necessary. |
4.0.0 |
Service Insertion Infra Status Down | Critical | esx | Service insertion infrastructure status down and not enabled on host. |
Perform any corrective action from the KB and check if the status is up. Resolve the alarm manually after checking the status. |
4.0.0 |
SVM Liveness State Down (deprecated) | Critical | manager | SVM liveness state down. |
Perform any corrective action from the KB and check if the state is up. |
4.0.0 |
Service Chain Path Down (deprecated) | Critical | manager | Service chain path down. |
Perform any corrective action from the KB and check if the status is up. |
4.0.0 |
New Host Added (deprecated) | Info | esx | New Host added in cluster. |
Check for the VM deployment status and wait till it powers on. |
4.0.0 |
TEP Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Faulty TEP | Medium | esx | TEP is unhealthy. |
1. Check if TEP has valid IP or any other underlay connectivity issues. |
4.1.0 |
TEP Ha Activated | Info | esx | TEP HA activated. |
Enable AutoRecover or invoke Manual Recover for TEP:{vtep_name} on VDS:{dvs_name} at Transport node:{transport_node_id}. |
4.1.0 |
TEP Autorecover Success | Info | esx | AutoRecover is successful. |
None. |
4.1.0 |
TEP Autorecover Failure | Medium | esx | AutoRecover failed. |
Check if TEP has valid IP or any other underlay connectivity issues. |
4.1.0 |
Faulty TEP On DPU | Medium | dpu | TEP is unhealthy on DPU. |
1. Check if TEP has valid IP or any other underlay connectivity issues. |
4.1.0 |
TEP Ha Activated On DPU | Info | dpu | TEP HA activated on DPU. |
Enable AutoRecover or invoke Manual Recover for TEP:{vtep_name} on VDS:{dvs_name}. at Transport node:{transport_node_id} on DPU {dpu_id}. |
4.1.0 |
TEP Autorecover Success On DPU | Info | dpu | AutoRecover is successful on DPU. |
None. |
4.1.0 |
TEP Autorecover Failure On DPU | Medium | dpu | AutoRecover failed on DPU. |
Check if TEP has valid IP or any other underlay connectivity issues. |
4.1.0 |
Topology Discovery Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
TN MTU Mismatch With Tor | Medium | manager | MTU configuration mismatch detected between uplink and TOR. |
The MTU configuration for the NDVS uplink and the TOR interface needs to be checked for consistency. |
4.2.0 |
Transport Node Health Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Monitoring Framework Unhealthy | Medium | bms, edge, esx, kvm, public-cloud-gateway | Monitoring Framework on Transport node is unhealthy. |
1. On problematic Edge or Public Gateway node, invoke the NSX CLI command restart service node-stats. |
4.1.1 |
Agent Memory Usage High | Medium | esx | The memory usage of a NSX agent exceeds its memory threshold. |
Refer to KB article - https://kb.vmware.com/s/article/95867 . |
4.2.0 |
Transport Node Uplink Down On DPU | Medium | dpu | Uplink on DPU is going down. |
Check the physical NICs' status of uplinks on DPU {dpu_id}. |
4.0.0 |
LAG Member Down On DPU | Medium | dpu | LACP on DPU reporting member down. |
Check the connection status of LAG members on DPU {dpu_id}. |
4.0.0 |
NVDS Uplink Down (deprecated) | Medium | esx, kvm, bms | Uplink is going down. |
Check the physical NICs' status of uplinks on hosts. |
3.0.0 |
Transport Node Uplink Down | Medium | esx, bms | Uplink is going down. |
Check the physical NICs' status of uplinks on hosts. |
3.2.0 |
LAG Member Down | Medium | esx, bms | LACP reporting member down. |
Check the connection status of LAG members on hosts. |
3.0.0 |
Enhanced Dp Flow Table Usage High | Medium | esx | Enhanced Datapath flow table utilization is high. |
For Enhanced Datapath (EDP) host switch mode, consider increasing the flow table size by invoking the following command, if performance degradation is observed nsxdp-cli ens flow-table size set -s $NUM_ENTRIES $NUM_ENTRIES must be a power of 2. Host must be rebooted. Increasing the number of flow entries doesn't always improve performance, if short lived connections keep coming in. The flow table might be always full regardless of the flow table size. A large flow-table size wouldn't help in this case. EDP has a logic to detect this and automatically enable and disable flow tables to handle such a case. Increasing the number of flow entries may increase the memory footprint. |
4.2.0 |
Enhanced Dp Flow Table Usage Very High | Critical | esx | Enhanced Datapath flow table utilization is very high. |
For Enhanced Datapath (EDP) host switch mode, consider increasing the flow table size by invoking the following command, if performance degradation is observed nsxdp-cli ens flow-table size set -s $NUM_ENTRIES $NUM_ENTRIES must be a power of 2. Host must be rebooted. Increasing the number of flow entries doesn't always improve performance, if short lived connections keep coming in. The flow table might be always full regardless of the flow table size. A large flow-table size wouldn't help in this case. EDP has a logic to detect this and automatically enable and disable flow tables to handle such a case. Increasing the number of flow entries may increase the memory footprint. |
4.2.0 |
Transport Node Pending Action Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Maintenance Mode | Critical | manager | The host has pending user actions i.e. PENDING_HOST_MAINTENANCE_MODE. |
Move host {host_name} - {host_uuid} to maintenance mode from vCenter. This will start realization of high performance configuration on the host. If processed successfully, transportNodeState will no longer have PENDING_HOST_MAINTENANCE_MODE inside pending_user_actions field. If realization of high performance configuration on the host fails, then transportNodeState will be updated with the failure message and the host will no longer be in pending maintenance mode. |
4.1.2 |
VMC App Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
Transit Connect Failure | Medium | manager | Transit Connect fails to be fully realized. |
If this alarm is not auto-resolved within 10 minutes, retry the most recent transit connect related request(s). For example, if a TGW attachment API request triggered this alarm, retry the TGW attachment API request again. If alarm does not resolve even after retry, then try the following steps: |
4.1.0 |
Traffic Group Prefix List Deletion Failure | High | manager | Failure in deletion of Traffic Group Prefix list. |
If this alarm is not auto-resolved within 10 minutes, then execute the following steps: |
4.1.2 |
Prefix List Capacity Issue Failure | High | manager | Prefix list capacity issue failure. |
1. Run API GET 'cloud-service/api/v1/infra/sddc/provider-resource-info?resource_type=managed_prefix_list' to get a list of all prefix lists from SDDC. a) Check the 'state' and 'status_message' of each prefix list in API output. b) If the state of any prefix list is 'modify-failed' and status message has the string 'The following VPC Route Table resources do not have sufficient capacity' then the prefix list has run into resizing failure. The 'status-message' is going to specify what route table Ids have to be increased in size. c) If the API output contains 'issues' field, it would specify what routes are missing from the managed prefix list. Calculate number of routes missing from 'issues' field. d) File a AWS ticket to increase size of the routing table identified in (b) by atleast minimum size identified in (c). e) After AWS increased the route table limit, wait for atleast 1 hour and then invoke API' GET 'cloud-service/api/v1/infra/sddc/provider-resource-info?resource_type=managed_prefix_list'. Make sure 'state' of any of the prefix list is not 'modify-failed'. |
4.1.2 |
Prefix List Resource Share Customer Failure | Medium | manager | Failure with prefix list resource share. |
If this alarm is not auto-resolved within 10 minutes, then execute the following steps: |
4.1.2 |
Resource Share Sanity Check Failure | High | manager | Failure in resource share check. |
If this alarm is not auto-resolved within 10 minutes, then execute the following steps: |
4.1.2 |
TGW Get Attachment Failure | High | manager | Failure in fetching TGW attachment. |
1. Log in to nsx manager. There are three manager nodes, we will need to find the leader node After logging in to one node, run command: -su admin -get cluster status verbose -Find out the TGW Leader node. |
4.1.2 |
TGW Attachment Mismatch Failure | High | manager | Failure due to mismatch of TGW attachments. |
1. Log in to nsx manager. There are three manager nodes, we will need to find the leader node After logging in to one node, run command: -su admin -get cluster status verbose -Find out the TGW Leader node. |
4.1.2 |
TGW Route Table Max Failure | High | manager | TGW Route table max entries failure. |
1. Login to NSX manager UI in 'Networking & Security' tab. Then navigate to 'transit connect' tab. |
4.1.2 |
TGW Route Update Failure | High | manager | TGW Route update fails due to wrong TGW attachment size. |
1. Run the following API 'GET /cloud-service/api/v1/infra/associated-groups'. The number of associated groups should only return 1 or 0. a) If the above API returns more than 1 associated groups, then do the following -Login to VMC UI and navigate to 'SDDC Groups' tab. -Find the correct SDDC group which contains this SDDC by checking each group members if it contains the SDDC under question. -Remove stale associations by running following API 'DELETE /cloud-service/api/v1/infra/associated-groups/<association-id>'. |
4.1.2 |
TGW Tagging Mismatch Failure | High | manager | Failure due to mismatch of TGW tags. |
If this alarm is not auto-resolved within 10 minutes, then execute the following steps: |
4.1.2 |
VPN Events
Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
---|---|---|---|---|---|
IPsec Service Down | Medium | edge, autonomous-edge, public-cloud-gateway | IPsec service is down. |
1. Deactivate and reactivate the IPsec service from NSX Manager UI. |
3.2.0 |
IPsec Policy Based Session Down | Medium | edge, autonomous-edge, public-cloud-gateway | Policy based IPsec VPN session is down. |
Check IPsec VPN session configuration and resolve errors based on the session down reason. |
3.0.0 |
IPsec Route Based Session Down | Medium | edge, autonomous-edge, public-cloud-gateway | Route based IPsec VPN session is down. |
Check IPsec VPN session configuration and resolve errors based on the session down reason. |
3.0.0 |
IPsec Policy Based Tunnel Down | Medium | edge, autonomous-edge, public-cloud-gateway | policy Based IPsec VPN tunnels are down. |
Check IPsec VPN session configuration and resolve errors based on the tunnel down reason. |
3.0.0 |
IPsec Route Based Tunnel Down | Medium | edge, autonomous-edge, public-cloud-gateway | Route based IPsec VPN tunnel is down. |
Check IPsec VPN session configuration and resolve errors based on the tunnel down reason. |
3.0.0 |
L2Vpn Session Down | Medium | edge, autonomous-edge, public-cloud-gateway | L2VPN session is down. |
Check L2VPN session status for session down reason and resolve errors based on the reason. |
3.0.0 |
Back to: VMware NSX Documentation