NSX Event Catalog
The following tables describe events that trigger alarms in VMware NSX®, including alarm messages and recommended actions to resolve them. Any event with a severity greater than LOW triggers an alarm. Alarms information is displayed in several locations within the NSX Manager interface. Alarm and event information is also included with other notifications in the Notifications drop-down menu in the title bar. To view alarms, navigate to the Home page and click the Alarms tab. For more information on alarms and events, see " Working with Events and Alarms" in the NSX Administration Guide.
Alarm Management Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Alarm Service Overloaded | Critical | global-manager, manager | The alarm service is overloaded. <br><br>When event detected: "Due to heavy volume of alarms reported, the alarm service is temporarily overloaded. The NSX UI and GET /api/v1/alarms NSX API have stopped reporting new alarms; however, syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of alarms are addressed, the alarm service will start reporting new alarms again. "<br><br>When event resolved: "The heavy volume of alarms has subsided and new alarms are being reported again. " |
Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new alarms again. |
3.0.0 |
| Heavy Volume Of Alarms | Critical | global-manager, manager | Heavy volume of a specific alarm type detected. <br><br>When event detected: "Due to heavy volume of {event_id} alarms, the alarm service has temporarily stopped reporting alarms of this type. The NSX UI and GET /api/v1/alarms NSX API are not reporting new instances of these alarms; however, syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of {event_id} alarms are addressed, the alarm service will start reporting new {event_id} alarms when new issues are detected again. "<br><br>When event resolved: "The heavy volume of {event_id} alarms has subsided and new alarms of this type are being reported again. " |
Review all active alarms of type {event_id} using the Alarms page in the NSX UI or using the NSX API GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED. For each active alarm investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new {event_id} alarms again. |
3.0.0 |
Audit Log Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Audit Log File Update Error | Critical | global-manager, manager, edge, public-cloud-gateway, esx, kvm, bms | At least one of the monitored log files cannot be written to. <br><br>When event detected: "At least one of the monitored log files has read-only permissions or has incorrect user/group ownership on Manager, Global Manager, Edge, Public Cloud Gateway, KVM or Linux Physical Server nodes. Or log folder is missing in Windows Physical Server nodes. Or rsyslog.log is missing on Manager, Global Manager, Edge or Public Cloud Gateway nodes. "<br><br>When event resolved: "All monitored log files have the correct file permissions and ownership on Manager, Global Manager, Edge, Public Cloud Gateway, KVM or Linux Physical Server nodes. And log folder exists on Windows Physical Server nodes. And rsyslog.log exists on Manager, Global Manager, Edge or Public Cloud Gateway nodes. " |
1. On Manager and Global Managaer nodes, Edge and Public Cloud Gateway nodes, Ubuntu KVM Host nodes ensure the permissions for the /var/log directory is 775 and the ownership is root:syslog. One Rhel KVM and BMS Host nodes ensure the permission for the /var/log directory is 755 and the ownership is root:root. |
3.1.0 |
| Remote Logging Server Error | Critical | global-manager, manager, edge, public-cloud-gateway | Log messages undeliverable due to incorrect remote logging server configuration. <br><br>When event detected: "Log messages to logging server {hostname_or_ip_address_with_port} ({entity_id}) cannot be delivered possibly due to an unresolvable FQDN, an invalid TLS certificate or missing NSX appliance iptables rule. "<br><br>When event resolved: "Configuration for logging server {hostname_or_ip_address_with_port} ({entity_id}) appear correct. " |
1. Ensure that {hostname_or_ip_address_with_port} is the correct hostname or IP address and port. |
3.1.0 |
Capacity Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Minimum Capacity Threshold | Medium | manager | A minimum capacity threshold has been breached. <br><br>When event detected: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is above the minimum capacity threshold of {min_capacity_threshold}%. "<br><br>When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is at or below the minimum capacity threshold of {min_capacity_threshold}%. " |
Navigate to the capacity page in the NSX UI and review current usage versus threshold limits. If the current usage is expected, consider increasing the minimum threshold values. If the current usage is unexpected, review the network policies configured to decrease usage at or below the minimum threshold. |
3.1.0 |
| Maximum Capacity Threshold | High | manager | A maximum capacity threshold has been breached. <br><br>When event detected: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is above the maximum capacity threshold of {max_capacity_threshold}%. "<br><br>When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is at or below the maximum capacity threshold of {max_capacity_threshold}%. " |
Navigate to the capacity page in the NSX UI and review current usage versus threshold limits. If the current usage is expected, consider increasing the maximum threshold values. If the current usage is unexpected, review the network policies configured to decrease usage at or below the maximum threshold. |
3.1.0 |
| Maximum Capacity | Critical | manager | A maximum capacity has been breached. <br><br>When event detected: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is above the maximum supported count of {max_supported_capacity_count}. "<br><br>When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is at or below the maximum supported count of {max_supported_capacity_count}. " |
Ensure that the number of NSX objects created is within the limits supported by NSX. If there are any unused objects, delete them using the respective NSX UI or API from the system. Consider increasing the form factor of all Manager nodes and/or Edge nodes. Note that the form factor of each node type should be the same. If not the same, the capacity limits for the lowest form factor deployed are used. |
3.1.0 |
Certificates Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Certificate Expired | Critical | global-manager, manager | A certificate has expired. <br><br>When event detected: "Certificate {entity_id} has expired. "<br><br>When event resolved: "The expired certificate {entity_id} has been removed or is no longer expired. " |
Ensure services that are currently using the certificate are updated to use a new, non-expired certificate. Once the expired certificate is no longer in use, it should be deleted by invoking the DELETE {api_collection_path}{entity_id} NSX API. If the expired certificate is used by NAPP Platform, the connection is broken between NSX and NAPP Platform. Check the NAPP Platform troubleshooting document to use a self-signed NAPP CA certificate for recovering the connection. |
3.0.0 |
| Certificate Is About To Expire | High | global-manager, manager | A certificate is about to expire. <br><br>When event detected: "Certificate {entity_id} is about to expire. "<br><br>When event resolved: "The expiring certificate {entity_id} has been removed or is no longer about to expire. " |
Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. Once the expiring certificate is no longer in use, it should be deleted by invoking the DELETE {api_collection_path}{entity_id} NSX API. |
3.0.0 |
| Certificate Expiration Approaching | Medium | global-manager, manager | A certificate is approaching expiration. <br><br>When event detected: "Certificate {entity_id} is approaching expiration. "<br><br>When event resolved: "The expiring certificate {entity_id} has been removed or is no longer approaching expiration. " |
Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. Once the expiring certificate is no longer in use, it should be deleted by invoking the DELETE {api_collection_path}{entity_id} NSX API. |
3.0.0 |
| CA Bundle Update Recommended | High | global-manager, manager | The update for a trusted CA bundle is recommended. <br><br>When event detected: "The trusted CA bundle {entity_id} was updated more than {ca_bundle_age_threshold} days ago. Update for the trusted CA bundle is recommended. "<br><br>When event resolved: "The trusted CA bundle {entity_id} has been removed, updated, or is no longer in use. " |
Ensure services that are currently using the trusted CA bundle are updated to use a recently-updated trusted CA bundle. Unless it is system-provided bundle, the bundle can be updated using the PUT /policy/api/v1/infra/cabundles/{entity_id} NSX API. Once the expired bundle is no longer in use, it should be deleted (if not system-provided) by invoking the DELETE /policy/api/v1/infra/cabundles/{entity_id} NSX API. |
3.2.0 |
| CA Bundle Update Suggested | Medium | global-manager, manager | The update for a trusted CA bundle is suggested. <br><br>When event detected: "The trusted CA bundle {entity_id} was updated more than {ca_bundle_age_threshold} days ago. Update for the trusted CA bundle is suggested. "<br><br>When event resolved: "The trusted CA bundle {entity_id} has been removed, updated, or is no longer in use. " |
Ensure services that are currently using the trusted CA bundle are updated to use a recently-updated trusted CA bundle. Unless it is system-provided bundle, the bundle can be updated using the PUT /policy/api/v1/infra/cabundles/{entity_id} NSX API. Once the expired bundle is no longer in use, it should be deleted (if not system-provided) by invoking the DELETE /policy/api/v1/infra/cabundles/{entity_id} NSX API. |
3.2.0 |
| Transport Node Certificate Expired | Critical | bms, edge, esx, kvm, public-cloud-gateway | A certificate has expired. <br><br>When event detected: "Certificate has expired for Transport node {entity_id}. "<br><br>When event resolved: "The expired certificate for Transport node {entity_id} has been replaced or is no longer expired. " |
Replace the Transport node {entity_id} certificate with a non-expired certificate. The expired certificate should be replaced by invoking the POST /api/v1/trust-management/certificates/action/replace-host-certificate/{entity_id} NSX API. If the expired certificate is used by Transport node, the connection is broken between Transport node and Manager node. |
4.1.0 |
| Transport Node Certificate Is About To Expire | High | bms, edge, esx, kvm, public-cloud-gateway | A certificate is about to expire. <br><br>When event detected: "Certificate for Transport node {entity_id} is about to expire. "<br><br>When event resolved: "The expiring certificate for Transport node {entity_id} has been removed or is no longer about to expire. " |
Replace the Transport node {entity_id} certificate with a non-expired certificate. The expired certificate should be replaced by invoking the POST /api/v1/trust-management/certificates/action/replace-host-certificate/{entity_id} NSX API. If the certificate is not replaced, when the certificate expires the connection between the Transport node and the Manager node will be broken. |
4.1.0 |
| Transport Node Certificate Expiration Approaching | Medium | bms, edge, esx, kvm, public-cloud-gateway | A certificate is approaching expiration. <br><br>When event detected: "Certificate for Transport node {entity_id} is approaching expiration. "<br><br>When event resolved: "The expiring certificate for Transport node {entity_id} has been removed or is no longer approaching expiration. " |
Replace the Transport node {entity_id} certificate with a non-expired certificate. The expired certificate should be replaced by invoking the POST /api/v1/trust-management/certificates/action/replace-host-certificate/{entity_id} NSX API. If the certificate is not replaced, when the certificate expires the connection between the Transport node and the Manager node will be broken. |
4.1.0 |
Clustering Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Cluster Degraded | Medium | global-manager, manager | Group member is down. <br><br>When event detected: "Group member {manager_node_id} of service {group_type} is down. "<br><br>When event resolved: "Group member {manager_node_id} of {group_type} is up. " |
1. Invoke the NSX CLI command 'get cluster status' to view the status of group members of the cluster. |
3.2.0 |
| Cluster Unavailable | High | global-manager, manager | All the group members of the service are down. <br><br>When event detected: "All group members {manager_node_ids} of service {group_type} are down. "<br><br>When event resolved: "All group members {manager_node_ids} of service {group_type} are up. " |
1. Ensure the service for {group_type} is running on node. Invoke the GET /api/v1/node/services/<service_name>/status NSX API or the get service <service_name> NSX CLI command to determine if the service is running. If not running, invoke the POST /api/v1/node/services/<service_name>?action=restart NSX API or the restart service <service_name> NSX CLI to restart the service. |
3.2.0 |
CNI Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Hyperbus Manager Connection Down On DPU | Medium | dpu | Hyperbus on DPU cannot communicate with the Manager node. <br><br>When event detected: "Hyperbus on DPU {dpu_id} cannot communicate with the Manager node. "<br><br>When event resolved: "Hyperbus on DPU {dpu_id} can communicate with the Manager node. " |
The hyperbus vmkernel interface (vmk50) on DPU {dpu_id} may be missing. Refer to Knowledge Base article https://kb.vmware.com/s/article/67432 . |
4.0.0 |
| Hyperbus Manager Connection Down | Medium | esx, kvm | Hyperbus cannot communicate with the Manager node. <br><br>When event detected: "Hyperbus cannot communicate with the Manager node. "<br><br>When event resolved: "Hyperbus can communicate with the Manager node. " |
The hyperbus vmkernel interface (vmk50) may be missing. Refer to Knowledge Base article https://kb.vmware.com/s/article/67432 . |
3.0.0 |
Communication Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Limited Reachability On DPU | Medium | dpu | The given collector can not be reached via vmknic(s) on given DVS on DPU. <br><br>When event detected: "The {vertical_name} collector {collector_ip} can not be reached via vmknic(s)(stack {stack_alias}) on DVS {dvs_alias} on DPU {dpu_id}, but is reachable via vmknic(s)(stack {stack_alias}) on other DVS(es). "<br><br>When event resolved: "The {vertical_name} collector {collector_ip} can be reached via vmknic(s) (stack {stack_alias}) on DVS {dvs_alias} on DPU {dpu_id}, or the {vertical_name} collector {collector_ip} is unreachable completely. " |
If the warning is on, it does not mean the collector is unreachable. The exported flows generated by the vertical based on DVS {dvs_alias} can still reach the collector {collector_ip} via vmknic(s) on DVS(es) besides of DVS {dvs_alias}. If this is unacceptable, user can try to create vmknic(s) with stack {stack_alias} on DVS {dvs_alias} and configure it with appropriate IPv4(6) address, then check if the {vertical_name} collector {collector_ip} can be reached via the newly-created vmknic(s) on DPU {dpu_id} by invoking vmkping {collector_ip} -S {stack_alias} -I vmkX with SSH to DPU via ESXi enabled. |
4.0.1 |
| Unreachable Collector On DPU | Critical | dpu | The given collector can not be reached via existing vmknic(s) on DPU at all. <br><br>When event detected: "The {vertical_name} collector {collector_ip} can not be reached via existing vmknic(s)(stack {stack_alias}) on any DVS on DPU {dpu_id}. "<br><br>When event resolved: "The {vertical_name} collector {collector_ip} can be reached with existing vmknic(s)(stack {stack_alias}) now on DPU {dpu_id}. " |
To make the collector reachable for given vertical on the DVS, user has to make sure there is(are) vmknic(s) with expected stack {stack_alias} created and configured with appropriate IPv4(6) addresses, and the network connection to {vertical_name} collector {collector_ip} is also fine. So user has to do the checking on DPU {dpu_id}, and perform required configuration to make sure the condition is met. Finally if vmkping {collector_ip} -S {stack_alias} with SSH to DPU via ESXi enabled succeeds, this indicates the problem is gone. |
4.0.1 |
| Manager Cluster Latency High | Medium | manager | The average network latency between Manager nodes is high. <br><br>When event detected: "The average network latency between Manager nodes {manager_node_id} ({appliance_address}) and {remote_manager_node_id} ({remote_appliance_address}) is more than 10ms for the last 5 minutes. "<br><br>When event resolved: "The average network latency between Manager nodes {manager_node_id} ({appliance_address}) and {remote_manager_node_id} ({remote_appliance_address}) is within 10ms. " |
Ensure there are no firewall rules blocking ping traffic between the Manager nodes. If there are other high bandwidth servers and applications sharing the local network, consider moving these to a different network. |
3.1.0 |
| Control Channel To Manager Node Down Too Long | Critical | bms, edge, esx, kvm, public-cloud-gateway | Transport node's control plane connection to the Manager node is down for long. <br><br>When event detected: "The Transport node {entity_id} control plane connection to Manager node {appliance_address} is down for at least {timeout_in_minutes} minutes from the Transport node's point of view. "<br><br>When event resolved: "The Transport node {entity_id} restores the control plane connection to Manager node {appliance_address}. " |
1. Check the connectivity from Transport node {entity_id} to Manager node {appliance_address} interface via ping. If they are not pingable, check for flakiness in network connectivity. |
3.1.0 |
| Control Channel To Manager Node Down | Medium | bms, edge, esx, kvm, public-cloud-gateway | Transport node's control plane connection to the Manager node is down. <br><br>When event detected: "The Transport node {entity_id} control plane connection to Manager node {appliance_address} is down for at least {timeout_in_minutes} minutes from the Transport node's point of view. "<br><br>When event resolved: "The Transport node {entity_id} restores the control plane connection to Manager node {appliance_address}. " |
1. Check the connectivity from Transport node {entity_id} to Manager node {appliance_address} interface via ping. If they are not pingable, check for flakiness in network connectivity. |
3.1.0 |
| Control Channel To Transport Node Down | Medium | manager | Controller service to Transport node's connection is down. <br><br>When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Transport node {transport_node_name} ({entity_id}) down for at least three minutes from Controller service's point of view. "<br><br>When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Transport node {entity_id}. " |
1. Check the connectivity from the Controller service {central_control_plane_id} and Transport node {entity_id} interface via ping and traceroute. This can be done on the NSX Manager node admin CLI. The ping test should not see drops and have consistent latency values. VMware recommends latency values of 150ms or less. |
3.1.0 |
| Control Channel To Transport Node Down Long | Critical | manager | Controller service to Transport node's connection is down for too long. <br><br>When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Transport node {transport_node_name} ({entity_id}) down for at least 15 minutes from Controller service's point of view. "<br><br>When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Transport node {entity_id}. " |
1. Check the connectivity from the Controller service {central_control_plane_id} and Transport node {entity_id} interface via ping and traceroute. This can be done on the NSX Manager node admin CLI. The ping test should not see drops and have consistent latency values. VMware recommends latency values of 150ms or less. |
3.1.0 |
| Control Channel To Antrea Cluster Down | Medium | manager | Controller service to Antrea cluster's connection is down. <br><br>When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Antrea cluster {antrea_cluster_node_name} ({entity_id}) down for at least three minutes from Controller service's point of view. "<br><br>When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Antrea cluster {entity_id}. " |
1. Check if the Antrea Kubernetes cluster is deleted. 2. Check Control Plane network connectivity issue. 3. Make sure the Antrea adapter has not crashed or deleted. 4. Check if there are issues with the client certificate used for Antrea to NSX integration. 5. Check the adapter version and make sure it is compatible with Antrea version. Refer to https://docs.vmware.com/en/VMware -NSX/4.1/administration/GUID-A4335451-CB6E-485B-8EF7-343CB1B5CF69.html for additional troubleshooting details. |
4.1.1 |
| Control Channel To Antrea Cluster Down Long | Critical | manager | Controller service to Antrea cluster's connection is down for too long. <br><br>When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Antrea cluster {antrea_cluster_node_name} ({entity_id}) down for at least 15 minutes from Controller service's point of view. "<br><br>When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Antrea cluster {entity_id}. " |
1. Check if the Antrea Kubernetes cluster is deleted. 2. Check Control Plane network connectivity issue. 3. Make sure the Antrea adapter has not crashed or deleted. 4. Check if there are issues with the client certificate used for Antrea to NSX integration. 5. Check the adapter version and make sure it is compatible with Antrea version. Refer to https://docs.vmware.com/en/VMware -NSX/4.1/administration/GUID-A4335451-CB6E-485B-8EF7-343CB1B5CF69.html for additional troubleshooting details. |
4.1.1 |
| Manager Control Channel Down | Critical | manager | Manager to controller channel is down. <br><br>When event detected: "The communication between the management function and the control function has failed on Manager node {manager_node_name} ({appliance_address}). "<br><br>When event resolved: "The communication between the management function and the control function has been restored on Manager node {manager_node_name} ({appliance_address}). " |
1. On Manager node {manager_node_name} ({appliance_address}), invoke the following NSX CLI command: get service applianceproxy to check the status of the service periodically for 60 minutes. |
3.0.2 |
| Management Channel To Transport Node Down | Medium | manager | Management channel to Transport node is down. <br><br>When event detected: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is down for 5 minutes. "<br><br>When event resolved: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is up. " |
Ensure there is network connectivity between the Manager nodes and Transport node {transport_node_name} ({transport_node_address}) and no firewalls are blocking traffic between the nodes. On Windows Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 status in the Windows PowerShell. If it is not running, restart it by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 restart. On all other Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command /etc/init.d/nsx-proxy status. If it is not running, restart it by invoking the command /etc/init.d/nsx-proxy restart. |
3.0.2 |
| Management Channel To Transport Node Down Long | Critical | manager | Management channel to Transport node is down for too long. <br><br>When event detected: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is down for 15 minutes. "<br><br>When event resolved: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is up. " |
Ensure there is network connectivity between the Manager nodes and Transport node {transport_node_name} ({transport_node_address}) and no firewalls are blocking traffic between the nodes. On Windows Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 status in the Windows PowerShell. If it is not running, restart it by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 restart. On all other Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command /etc/init.d/nsx-proxy status. If it is not running, restart it by invoking the command /etc/init.d/nsx-proxy restart. |
3.0.2 |
| Manager FQDN Lookup Failure | Critical | global-manager, bms, edge, esx, kvm, manager, public-cloud-gateway | DNS lookup failed for Manager node's FQDN. <br><br>When event detected: "DNS lookup failed for Manager node {entity_id} with FQDN {appliance_fqdn} and the publish_fqdns flag was set. "<br><br>When event resolved: "FQDN lookup succeeded for Manager node {entity_id} with FQDN {appliance_fqdn} or the publish_fqdns flag was cleared. " |
1. Assign correct FQDNs to all Manager nodes and verify the DNS configuration is correct for successful lookup of all Manager nodes' FQDNs. |
3.1.0 |
| Manager FQDN Reverse Lookup Failure | Critical | global-manager, manager | Reverse DNS lookup failed for Manager node's IP address. <br><br>When event detected: "Reverse DNS lookup failed for Manager node {entity_id} with IP address {appliance_address} and the publish_fqdns flag was set. "<br><br>When event resolved: "Reverse DNS lookup succeeded for Manager node {entity_id} with IP address {appliance_address} or the publish_fqdns flag was cleared. " |
1. Assign correct FQDNs to all Manager nodes and verify the DNS configuration is correct for successful reverse lookup of the Manager node's IP address. |
3.1.0 |
| Management Channel To Manager Node Down | Medium | bms, edge, esx, kvm, public-cloud-gateway | Management channel to Manager node is down. <br><br>When event detected: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is down for 5 minutes. "<br><br>When event resolved: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is up. " |
Ensure there is network connectivity between the Transport node {transport_node_id} and leader Manager node. Also ensure no firewalls are blocking traffic between the nodes. Ensure the messaging manager service is running on Manager nodes by invoking the command /etc/init.d/messaging-manager status. If the messaging manager is not running, restart it by invoking the command /etc/init.d/messaging-manager restart. |
3.2.0 |
| Management Channel To Manager Node Down Long | Critical | bms, edge, esx, kvm, public-cloud-gateway | Management channel to Manager node is down for too long. <br><br>When event detected: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is down for 15 minutes. "<br><br>When event resolved: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is up. " |
Ensure there is network connectivity between the Transport node {transport_node_id} and leader Manager nodes. Also ensure no firewalls are blocking traffic between the nodes. Ensure the messaging manager service is running on Manager nodes by invoking the command /etc/init.d/messaging-manager status. If the messaging manager is not running, restart it by invoking the command /etc/init.d/messaging-manager restart. |
3.2.0 |
| Network Latency High | Medium | manager | Management to Transport node network latency is high. <br><br>When event detected: "The average network latency between manager nodes and host {transport_node_name} ({transport_node_address}) is more than 150 ms for 5 minutes. "<br><br>When event resolved: "The average network latency between manager nodes and host {transport_node_name} ({transport_node_address}) is normal. " |
1. Wait for 5 minutes to see if the alarm automatically gets resolved. |
4.0.0 |
DHCP Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Pool Lease Allocation Failed | High | edge, autonomous-edge, public-cloud-gateway | IP addresses in an IP Pool have been exhausted. <br><br>When event detected: "The addresses in IP Pool {entity_id} of DHCP Server {dhcp_server_id} have been exhausted. The last DHCP request has failed and future requests will fail. "<br><br>When event resolved: "IP Pool {entity_id} of DHCP Server {dhcp_server_id} is no longer exhausted. A lease is successfully allocated to the last DHCP request. " |
Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking | Segments | Segment page in the NSX UI. |
3.0.0 |
| Pool Overloaded | Medium | edge, autonomous-edge, public-cloud-gateway | An IP Pool is overloaded. <br><br>When event detected: "The DHCP Server {dhcp_server_id} IP Pool {entity_id} usage is approaching exhaustion with {dhcp_pool_usage}% IPs allocated. "<br><br>When event resolved: "The DHCP Server {dhcp_server_id} IP Pool {entity_id} has fallen below the high usage threshold. " |
Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking | Segments | Segment page in the NSX UI. |
3.0.0 |
Distributed Firewall Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| DFW CPU Usage Very High | Critical | esx | DFW CPU usage is very high. <br><br>When event detected: "The DFW CPU usage on Transport node {transport_node_name} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The DFW CPU usage on Transport node {transport_node_name} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
Consider re-balancing the VM workloads on this host to other hosts. Review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter. |
3.0.0 |
| DFW CPU Usage Very High On DPU | Critical | dpu | DFW CPU usage is very high on dpu. <br><br>When event detected: "The DFW CPU usage on Transport node {transport_node_name} has reached {system_resource_usage}% on DPU {dpu_id} which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The DFW CPU usage on Transport node {transport_node_name} has reached {system_resource_usage}% on DPU {dpu_id} which is below the very high threshold value of {system_usage_threshold}%. " |
Consider re-balancing the VM workloads on this host to other hosts. Review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter. |
4.0.0 |
| DFW Memory Usage Very High | Critical | esx | DFW Memory usage is very high. <br><br>When event detected: "The DFW Memory usage {heap_type} on Transport node {transport_node_name} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The DFW Memory usage {heap_type} on Transport node {transport_node_name} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the host. Consider re-balancing the workloads on this host to other hosts. |
3.0.0 |
| DFW Memory Usage Very High On DPU | Critical | dpu | DFW Memory usage is very high on DPU. <br><br>When event detected: "The DFW Memory usage {heap_type} on Transport node {transport_node_name} has reached {system_resource_usage}% on DPU {dpu_id} which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The DFW Memory usage {heap_type} on Transport node {transport_node_name} has reached {system_resource_usage}% on DPU {dpu_id} which is below the very high threshold value of {system_usage_threshold}%. " |
View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the DPU. Consider re-balancing the workloads on this host to other hosts. |
4.0.0 |
| DFW VMotion Failure | Critical | esx | DFW vMotion failed, port disconnected. <br><br>When event detected: "The DFW vMotion for DFW filter {entity_id} on destination host {transport_node_name} has failed and the port for the entity has been disconnected. "<br><br>When event resolved: "The DFW configuration for DFW filter {entity_id} on the destination host {transport_node_name} has succeeded and error caused by DFW vMotion failure cleared. " |
Check VMs on the host in NSX Manager, manually repush the DFW configuration through NSX Manager UI. The DFW policy to be repushed can be traced by the DFW filter {entity_id}. Also consider finding the VM to which the DFW filter is attached and restart it. |
3.2.0 |
| DFW Flood Limit Warning | Medium | esx | DFW flood limit has reached warning level. <br><br>When event detected: "The DFW flood limit for DFW filter {entity_id} on host {transport_node_name} has reached warning level of 80% of the configured limit for protocol {protocol_name}. "<br><br>When event resolved: "The warning flood limit condition for DFW filter {entity_id} on host {transport_node_name} for protocol {protocol_name} is cleared. " |
Check VMs on the host in NSX Manager, check configured flood warning level of the DFW filter {entity_id} for protocol {protocol_name}. |
4.1.0 |
| DFW Flood Limit Critical | Critical | esx | DFW flood limit has reached critical level. <br><br>When event detected: "The DFW flood limit for DFW filter {entity_id} on host {transport_node_name} has reached critical level of 98% of the configured limit for protocol {protocol_name}. "<br><br>When event resolved: "The critical flood limit condition for DFW filter {entity_id} on host {transport_node_name} for protocol {protocol_name} is cleared. " |
Check VMs on the host in NSX Manager, check configured flood critical level of the DFW filter {entity_id} for protocol {protocol_name}. |
4.1.0 |
| DFW Session Count High | Critical | esx | DFW session count is high. <br><br>When event detected: "The DFW session count is high on Transport node {entity_id}, it has reached {system_resource_usage}% which is at or above the threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The DFW session count on Transport node {entity_id} has reached {system_resource_usage}% which is below the the threshold value of {system_usage_threshold}%. " |
Review the network traffic load level of the workloads on the host. Consider re-balancing the workloads on this host to other hosts. |
3.2.0 |
| DFW Rules Limit Per Vnic Exceeded | Critical | esx | DFW rules limit per vNIC is about to exceed the maximum limit. <br><br>When event detected: "The DFW rules limit for VIF {entity_id} on destination host {transport_node_name} is about to exceed the maximum limit. "<br><br>When event resolved: "The DFW rules limit for VIF {entity_id} on the destination host {transport_node_name} dropped below maximum limit. " |
Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall <VIF_UUID> ruleset rules to get the rule statistics for rules configured on the corresponding VIF. Reduce the number of rules configured for VIF {entity_id}. |
4.0.0 |
| DFW Rules Limit Per Vnic Approaching | Medium | esx | DFW rules limit per vNIC is approaching the maximum limit. <br><br>When event detected: "The DFW rules limit for VIF {entity_id} on destination host {transport_node_name} is approaching the maximum limit. "<br><br>When event resolved: "The DFW rules limit for VIF {entity_id} on the destination host {transport_node_name} dropped below the threshold. " |
Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall <VIF_UUID> ruleset rules to get the rule statistics for rules configured on the corresponding VIF. Reduce the number of rules configured for VIF {entity_id}. |
4.0.0 |
| DFW Rules Limit Per Host Exceeded | Critical | esx | DFW rules limit per host is about to exceed the maximum limit. <br><br>When event detected: "The DFW rules limit for host {transport_node_name} is about to exceed the maximum limit. "<br><br>When event resolved: "The DFW rules limit for host {transport_node_name} dropped below maximum limit. " |
Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall rule-stats total to get the rule statistics for rules configured on the ESX host {transport_node_name}. Reduce the number of rules configured for host {transport_node_name}. Check the number of rules configured for various VIFs by using NSX CLI command get firewall <VIF_UUID> ruleset rules. Reduce the number of rules configured for various VIFs. |
4.0.0 |
| DFW Rules Limit Per Host Approaching | Medium | esx | DFW rules limit per host is approaching the maximum limit. <br><br>When event detected: "The DFW rules limit for host {transport_node_name} is approaching the maximum limit. "<br><br>When event resolved: "The DFW rules limit for host {transport_node_name} dropped below the threshold. " |
Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall rule-stats total to get the rule statistics for rules configured on the ESX host {transport_node_name}. Reduce the number of rules configured for host {transport_node_name}. Check the number of rules configured for various VIFs by using NSX CLI command get firewall <VIF_UUID> ruleset rules. Reduce the number of rules configured for various VIFs. |
4.0.0 |
Distributed IDS IPS Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Max Events Reached | Medium | manager | Max number of intrusion events reached. <br><br>When event detected: "The number of intrusion events in the system is {ids_events_count} which is higher than the maximum allowed value {max_ids_events_allowed}. "<br><br>When event resolved: "The number of intrusion events in the system is {ids_events_count} which is below the maximum allowed value {max_ids_events_allowed}. " |
There is no manual intervention required. A purge job will kick in automatically every 3 minutes and delete 10% of the older records to bring the total intrusion events count in the system to below the threshold value of 1.5 million events. |
3.1.0 |
| NSX IDPS Engine Memory Usage High | Medium | esx | NSX-IDPS engine memory usage reaches 75% or above. <br><br>When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the high threshold value of 75%. "<br><br>When event resolved: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is below the high threshold value of 75%. " |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
| NSX IDPS Engine Memory Usage High On DPU | Medium | dpu | NSX-IDPS engine memory usage reaches 75% or above on DPU. <br><br>When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the high threshold value of 75% on DPU {dpu_id}. "<br><br>When event resolved: "NSX-IDPS engine memory usage has reached on DPU {dpu_id}, {system_resource_usage}%, which is below the high threshold value of 75%. " |
Consider re-balancing the VM workloads on this host to other hosts. |
4.0.0 |
| NSX IDPS Engine Memory Usage Medium High | High | esx | NSX-IDPS Engine memory usage reaches 85% or above. <br><br>When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the medium high threshold value of 85%. "<br><br>When event resolved: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is below the medium high threshold value of 85%. " |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
| NSX IDPS Engine Memory Usage Medium High On DPU | High | dpu | NSX-IDPS Engine memory usage reaches 85% or above on DPU. <br><br>When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the medium high threshold value of 85% on DPU {dpu_id}. "<br><br>When event resolved: "NSX-IDPS engine memory usage has reached on DPU {dpu_id}, {system_resource_usage}%, which is below the medium high threshold value of 85%. " |
Consider re-balancing the VM workloads on this host to other hosts. |
4.0.0 |
| NSX IDPS Engine Memory Usage Very High | Critical | esx | NSX-IDPS engine memory usage reaches 95% or above. <br><br>When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the very high threshold value of 95%. "<br><br>When event resolved: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is below the very high threshold value of 95%. " |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
| NSX IDPS Engine Memory Usage Very High On DPU | Critical | dpu | NSX-IDPS engine memory usage reaches 95% or above on DPU. <br><br>When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the very high threshold value of 95% on DPU {dpu_id}. "<br><br>When event resolved: "NSX-IDPS engine memory usage has reached on DPU {dpu_id}, {system_resource_usage}%, which is below the very high threshold value of 95%. " |
Consider re-balancing the VM workloads on this host to other hosts. |
4.0.0 |
| NSX IDPS Engine CPU Usage High (deprecated) | Medium | esx | NSX-IDPS engine CPU usage reaches 75% or above. <br><br>When event detected: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is at or above the high threshold value of 75%. "<br><br>When event resolved: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is below the high threshold value of 75%. " |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
| NSX IDPS Engine CPU Usage Medium High (deprecated) | High | esx | NSX-IDPS engine CPU usage reaches 85% or above. <br><br>When event detected: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is at or above the medium high threshold value of 85%. "<br><br>When event resolved: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is below the medium high threshold value of 85%. " |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
| NSX IDPS Engine CPU Usage Very High (deprecated) | Critical | esx | NSX-IDPS engine CPU usage exceeded 95% or above. <br><br>When event detected: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is at or above the very high threshold value of 95%. "<br><br>When event resolved: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is below the very high threshold value of 95%. " |
Consider re-balancing the VM workloads on this host to other hosts. |
3.1.0 |
| NSX IDPS Engine Down | Critical | esx | NSX IDPS is activated via NSX Policy and IDPS rules are configured, but NSX-IDPS engine is down. <br><br>When event detected: "NSX IDPS is activated via NSX policy and IDPS rules are configured, but NSX-IDPS engine is down. "<br><br>When event resolved: "NSX IDPS is in one of the cases below. 1. NSX IDPS is deactivated via NSX policy. 2. NSX IDPS engine is activated, NSX-IDPS engine and vdpi are up, and NSX IDPS has been activated and IDPS rules are configured via NSX Policy. " |
1. Check /var/log/nsx-syslog.log to see if there are errors reported. |
3.1.0 |
| NSX IDPS Engine Down On DPU | Critical | dpu | NSX IDPS is activated via NSX Policy and IDPS rules are configured, but NSX-IDPS engine is down on DPU. <br><br>When event detected: "NSX IDPS is activated via NSX policy and IDPS rules are configured, but NSX-IDPS engine is down on DPU {dpu_id}. "<br><br>When event resolved: "NSX IDPS is in one of the cases below on DPU {dpu_id}. 1. NSX IDPS is deactivated via NSX policy. 2. NSX IDPS engine is activated, NSX-IDPS engine and vdpi are up, and NSX IDPS has been activated and IDPS rules are configured via NSX Policy. " |
1. Check /var/log/nsx-idps/nsx-idps.log and /var/log/nsx-syslog.log to see if there are errors reported. |
4.0.0 |
| IDPS Engine CPU Oversubscription High | Medium | esx | CPU utilization for distributed IDPS engine is high. <br><br>When event detected: "CPU utilization for the distributed IDPS engine is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "CPU utilization for the distributed IDPS engine is below the high threshold value of {system_usage_threshold}%. " |
Review reason for oversubscription. Move certain applications to different host. |
4.0.0 |
| IDPS Engine CPU Oversubscription Very High | High | esx | CPU utilization for distributed IDPS engine is very high. <br><br>When event detected: "CPU utilization for the distributed IDPS engine is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "CPU utilization for the distributed IDPS engine is below the very high threshold value of {system_usage_threshold}%. " |
Review reason for oversubscription. Move certain applications to different host. |
4.0.0 |
| IDPS Engine Network Oversubscription High | Medium | esx | Network utilization for distributed IDPS engine is high. <br><br>When event detected: "Network utilization for the distributed IDPS engine is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "Network utilization for the distributed IDPS engine is below the high threshold value of {system_usage_threshold}%. " |
Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service. |
4.0.0 |
| IDPS Engine Network Oversubscription Very High | High | esx | Network utilization for distributed IDPS engine is very high. <br><br>When event detected: "Network utilization for the distributed IDPS engine is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "Network utilization for the distributed IDPS engine is below the very high threshold value of {system_usage_threshold}%. " |
Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service. |
4.0.0 |
| IDPS Engine Dropped Traffic CPU Oversubscribed | Critical | esx | Distributed IDPS Engine Dropped Traffic due to CPU Oversubscription. <br><br>When event detected: "The IDPS engine has insufficient CPU resources and is unable to keep pace with the incoming traffic resulting in the excess traffic being dropped. For more details, login to the ESX host and issue the following command: vsipioctl getdpiinfo -s and look at oversubscription stats. "<br><br>When event resolved: "The distributed IDPS engine has adequate CPU resources and is not dropping any traffic. " |
Review reason for oversubscription. Move certain applications to different host. |
4.0.0 |
| IDPS Engine Dropped Traffic Network Oversubscribed | Critical | esx | Distributed IDPS Engine Dropped Traffic due to Network Oversubscription. <br><br>When event detected: "The IDPS engine is unable to keep pace with the rate of incoming traffic resulting in the excess traffic being dropped. For more details, login to the ESX host and issue the following command: vsipioctl getdpiinfo -s and look at oversubscription stats. "<br><br>When event resolved: "The distributed IDPS engine is not dropping any traffic. " |
Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service. |
4.0.0 |
| IDPS Engine Bypassed Traffic CPU Oversubscribed | Critical | esx | Distributed IDPS Engine Bypassed Traffic due to CPU Oversubscription. <br><br>When event detected: "The IDPS engine has insufficient CPU resources and is unable to keep pace with the incoming traffic resulting in the excess traffic being bypassed. For more details, login to the ESX host and issue the following command: vsipioctl getdpiinfo -s and look at oversubscription stats. "<br><br>When event resolved: "The distributed IDPS engine has adequate CPU resources and is not bypassing any traffic. " |
Review reason for oversubscription. Move certain applications to different host. |
4.0.0 |
| IDPS Engine Bypassed Traffic Network Oversubscribed | Critical | esx | Distributed IDPS Engine Bypassed Traffic due to Network Oversubscription. <br><br>When event detected: "The IDPS engine is unable to keep pace with the rate of incoming traffic resulting in the excess traffic being bypassed. For more details, login to the ESX host and issue the following command: vsipioctl getdpiinfo -s and look at oversubscription stats. "<br><br>When event resolved: "The distributed IDPS engine is not bypassing any traffic. " |
Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service. |
4.0.0 |
DNS Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Forwarder Down | High | edge, autonomous-edge, public-cloud-gateway | A DNS forwarder is down. <br><br>When event detected: "DNS forwarder {entity_id} is not running. This is impacting the identified DNS Forwarder that is currently activated. "<br><br>When event resolved: "DNS forwarder {entity_id} is running again. " |
1. Invoke the NSX CLI command get dns-forwarders status to verify if the DNS forwarder is in down state. |
3.0.0 |
| Forwarder Disabled (deprecated) | Info | edge, autonomous-edge, public-cloud-gateway | A DNS forwarder is deactivated. <br><br>When event detected: "DNS forwarder {entity_id} is deactivated. "<br><br>When event resolved: "DNS forwarder {entity_id} is activated. " |
1. Invoke the NSX CLI command get dns-forwarders status to verify if the DNS forwarder is in the deactivated state. |
3.0.0 |
| Forwarder Upstream Server Timeout | High | edge, autonomous-edge, public-cloud-gateway | One DNS forwarder upstream server has timed out. <br><br>When event detected: "DNS forwarder {intent_path}({dns_id}) did not receive a timely response from upstream server {dns_upstream_ip}. Compute instance connectivity to timed out FQDNs may be impacted. "<br><br>When event resolved: "DNS forwarder {intent_path}({dns_id}) upstream server {dns_upstream_ip} is normal. " |
1. Invoke the NSX API GET /api/v1/dns/forwarders/{dns_id}/nslookup? address=<address>&server_ip={dns_upstream_ip}&source_ip=<source_ip>. This API request triggers a DNS lookup to the upstream server in the DNS forwarder's network namespace. <address> is the IP address or FQDN in the same domain as the upstream server. <source_ip> is an IP address in the upstream server's zone. If the API returns a connection timed out response, there is likely a network error or upstream server problem. Check why DNS lookups are not reaching the upstream server or why the upstream server is not returning a response. If the API response indicates the upstream server is answering, proceed to step 2. |
3.1.3 |
Edge Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Edge Node Settings Mismatch | Critical | manager | Edge node settings mismatch. <br><br>When event detected: "The Edge node {entity_id} settings configuration does not match the policy intent configuration. The Edge node configuration visible to user on UI or API is not same as what is realized. The realized Edge node changes made by user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for the Edge node are listed in runtime data {edge_node_setting_mismatch_reason} "<br><br>When event resolved: "Edge node {entity_id} node settings are consistent with policy intent now. " |
Review the node settings of this Edge transport node {entity_id}. Follow one of following actions to resolve alarm - |
3.2.0 |
| Edge VM VSphere Settings Mismatch | Critical | manager | Edge VM vSphere settings mismatch. <br><br>When event detected: "The Edge node {entity_id} configuration on vSphere does not match the policy intent configuration. The Edge node configuration visible to user on UI or API is not same as what is realized. The realized Edge node changes made by user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for the Edge node are listed in runtime data {edge_vm_vsphere_settings_mismatch_reason} "<br><br>When event resolved: "Edge node {entity_id} VM vSphere settings are consistent with policy intent now. " |
Review the vSphere configuration of this Edge Transport Node {entity_id}. Follow one of following actions to resolve alarm - |
3.2.0 |
| Edge Node Settings And VSphere Settings Are Changed | Critical | manager | Edge node settings and vSphere settings are changed. <br><br>When event detected: "The Edge node {entity_id} settings and vSphere configuration are changed and does not match the policy intent configuration. The Edge node configuration visible to user on UI or API is not same as what is realized. The realized Edge node changes made by user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for Edge node settings and vSphere configuration are listed in runtime data {edge_node_settings_and_vsphere_settings_mismatch_reason} "<br><br>When event resolved: "Edge node {entity_id} node settings and vSphere settings are consistent with policy intent now. " |
Review the node settings and vSphere configuration of this Edge Transport Node {entity_id}. Follow one of following actions to resolve alarm - |
3.2.0 |
| Edge VSphere Location Mismatch | High | manager | Edge vSphere Location Mismatch. <br><br>When event detected: "The Edge node {entity_id} has been moved using vMotion. The Edge node {entity_id}, the configuration on vSphere does not match the policy intent configuration. The Edge node configuration visible to user on UI or API is not same as what is realized. The realized Edge node changes made by user outside of NSX Manager are shown in the details of this alarm. Fields that differ for the Edge node are listed in runtime data {edge_vsphere_location_mismatch_reason} "<br><br>When event resolved: "Edge node {entity_id} node vSphere settings are consistent with policy intent now. " |
Review the vSphere configuration of this Edge Transport Node {entity_id}. Follow one of following actions to resolve alarm - |
3.2.0 |
| Edge VM Present In NSX Inventory Not Present In VCenter | Critical | manager | Auto Edge VM is present in NSX inventory but not present in vCenter. <br><br>When event detected: "The VM {policy_edge_vm_name} with moref id {vm_moref_id} corresponding to the Edge Transport node {entity_id} vSphere placement parameters is found in NSX inventory but is not present in vCenter. Check if the VM has been removed in vCenter or is present with a different VM moref id. "<br><br>When event resolved: "Edge node {entity_id} with VM moref id {vm_moref_id} is present in both NSX inventory and vCenter. " |
The managed object reference moref id of a VM has the form vm-number, which is visible in the URL on selecting the Edge VM in vCenter UI. Example vm-12011 in https://<vc-url>/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-12011:164ff798-c4f1-495b-a0be-adfba337e5d2/summary Find the VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter for this Edge Transport Node {entity_id}. If the Edge VM is present in vCenter with a different moref id, follow the below action. Use NSX add or update placement API with JSON request payload properties vm_id and vm_deployment_config to update the new vm moref id and vSphere deployment parameters. POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=addOrUpdatePlacementReferences. If the Edge VM with name {policy_edge_vm_name} is not present in vCenter, use the NSX Redeploy API to deploy a new VM for the Edge node. POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=redeploy. |
3.2.1 |
| Edge VM Not Present In Both NSX Inventory And VCenter (deprecated) | Critical | manager | Auto Edge VM is not present in both NSX inventory and in vCenter. <br><br>When event detected: "The VM {policy_edge_vm_name} with moref id {vm_moref_id} corresponding to the Edge Transport node {entity_id} vSphere placement parameters is not found in both NSX inventory and vCenter. The placement parameters in the vSphere configuration of this Edge Transport node {entity_id} refer to the VM with moref {vm_moref_id}. "<br><br>When event resolved: "Edge node {entity_id} with VM moref id {vm_moref_id} is present in both NSX inventory and vCenter. " |
The managed object reference moref id of a VM has the form vm-number, which is visible in the URL on selecting the Edge VM in vCenter UI. Example vm-12011 in https://<vc-url>/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-12011:164ff798-c4f1-495b-a0be-adfba337e5d2/summary Find the VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter for this Edge Transport Node {entity_id}. Follow the below action to resolve the alarm - Check if VM has been deleted in vSphere or is present with a different moref id. |
3.2.1 |
| Failed To Delete The Old VM In VCenter During Redeploy | Critical | manager | Power off and delete operation failed for old Edge VM in vCenter during Redeploy. <br><br>When event detected: "Failed to power off and delete the Edge node {entity_id} VM with moref id {vm_moref_id} in vCenter during Redeploy operation. A new Edge VM with moref id {new_vm_moref_id} has been deployed. Both old and new VMs for this Edge are functional at the same time and may result in IP conflicts and networking issues. "<br><br>When event resolved: "Edge node {entity_id} with stale VM moref id {vm_moref_id} is not found anymore in both NSX inventory and vCenter. New deployed VM with moref id {new_vm_moref_id} is present in both NSX inventory and vCenter. " |
The managed object reference moref id of a VM has the form vm-number, which is visible in the URL on selecting the Edge VM in vCenter UI. Example vm-12011 in https://<vc-url>/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-12011:164ff798-c4f1-495b-a0be-adfba337e5d2/summary Find the VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter for this Edge Transport Node {entity_id}. Power off and delete the old Edge VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter. |
3.2.1 |
| Edge Hardware Version Mismatch | Medium | manager | Edge node has hardware version mismatch. <br><br>When event detected: "The Edge node {transport_node_name} in Edge cluster {edge_cluster_name} has a hardware version {edge_tn_hw_version}, which is less than the highest hardware version {edge_cluster_highest_hw_version} in the Edge cluster. "<br><br>When event resolved: "The Edge node {transport_node_name} hardware version mismatch is resolved now. " |
Please follow KB article to resolve hardware version mismatch alarm for Edge node {transport_node_name}. |
4.0.1 |
| Stale Edge Node Entry Found | Critical | manager | Stale entries found for Edge Node. <br><br>When event detected: "The delete operation for Edge Node {transport_node_name} with UUID {entity_id} could not be completed successfully. A few stale entries may be present in the system. If the stale entry of this Edge Node is not deleted, it could lead to duplicate IPs getting assigned to newly deployed Edge Nodes and can impact the datapath. "<br><br>When event resolved: "All stale entries for the Edge node {entity_id} is cleared now. " |
Please follow the KB article to clear the stale entries for the Edge Node {transport_node_name} with UUID {entity_id}. |
4.1.1 |
| Uplink Fpeth Interface Mismatch During Replacement | Critical | manager | Uplinks to fp-eth interfaces mismatch. <br><br>When event detected: "The mapping of uplinks to fp-eth interfaces {old_fp_eth_list} for the Edge node {transport_node_name} with UUID {entity_id} is not present in the new bare metal Edge fp-eth interfaces {new_fp_eth_list}. "<br><br>When event resolved: "The mismatch between uplinks to fp-eth interfaces is resolved now. " |
Update the mapping of uplinks to fp-eth interfaces {old_fp_eth_list} via UI or API - PUT https://<manager-ip>/api/v1/transport-nodes/<tn-id> as per the new fp-eth interfaces {new_fp_eth_list}. |
4.1.1 |
Edge Cluster Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Edge Cluster Member Relocate Failure | Critical | manager | Edge cluster member relocate failure alarm <br><br>When event detected: "The operation on Edge cluster {edge_cluster_id} to relocate all service context failed for Edge cluster member index {member_index_id} with Transport node ID {transport_node_id} "<br><br>When event resolved: "Edge node with {transport_node_id} relocation failure has been resolved now. " |
Review the available capacity for the Edge cluster. If more capacity is required, scale your Edge cluster. Retry the relocate Edge cluster member operation. |
4.0.0 |
Edge Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Edge CPU Usage Very High | Critical | edge, public-cloud-gateway | Edge node CPU usage is very high. <br><br>When event detected: "The CPU usage on Edge node {transport_node_name} ({transport_node_address}) has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage on Edge node {transport_node_name} ({transport_node_address}) has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
Review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
3.0.0 |
| Edge CPU Usage High | Medium | edge, public-cloud-gateway | Edge node CPU usage is high. <br><br>When event detected: "The CPU usage on Edge node {transport_node_name} ({transport_node_address}) has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage on Edge node {transport_node_name} ({transport_node_address}) has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
3.0.0 |
| Edge Memory Usage Very High | Critical | edge, public-cloud-gateway | Edge node memory usage is very high. <br><br>When event detected: "The memory usage on Edge node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage on Edge node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
Review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
3.0.0 |
| Edge Memory Usage High | Medium | edge, public-cloud-gateway | Edge node memory usage is high. <br><br>When event detected: "The memory usage on Edge node {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage on Edge node {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload. |
3.0.0 |
| Edge Disk Usage Very High | Critical | edge, public-cloud-gateway | Edge node disk usage is very high. <br><br>When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
3.0.0 |
| Edge Disk Usage High | Medium | edge, public-cloud-gateway | Edge node disk usage is high. <br><br>When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
3.0.0 |
| Edge Datapath CPU Very High | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node datapath CPU usage is very high. <br><br>When event detected: "The datapath CPU usage on Edge node {entity_id} has reached {datapath_resource_usage}% which is at or above the very high threshold for at least two minutes. "<br><br>When event resolved: "The CPU usage on Edge node {entity_id} has reached below the very high threshold. " |
Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters. |
3.0.0 |
| Edge Datapath CPU High | Medium | edge, autonomous-edge, public-cloud-gateway | Edge node datapath CPU usage is high. <br><br>When event detected: "The datapath CPU usage on Edge node {entity_id} has reached {datapath_resource_usage}% which is at or above the high threshold for at least two minutes. "<br><br>When event resolved: "The CPU usage on Edge node {entity_id} has reached below the high threshold. " |
Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters. |
3.0.0 |
| Edge Datapath Configuration Failure | High | edge, autonomous-edge, public-cloud-gateway | Edge node datapath configuration failed. <br><br>When event detected: "Failed to enable the datapath on the Edge node after three attempts. "<br><br>When event resolved: "Datapath on the Edge node has been successfully enabled. " |
Ensure the Edge node's connectivity to the Manager node is healthy. From the Edge node's NSX CLI, invoke the command get services to check the health of services. If the dataplane service is stopped, invoke the command start service dataplane to start it. |
3.0.0 |
| Edge Datapath Cryptodrv Down | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node crypto driver is down. <br><br>When event detected: "Edge node crypto driver {edge_crypto_drv_name} is down. "<br><br>When event resolved: "Edge node crypto driver {edge_crypto_drv_name} is up. " |
Upgrade the Edge node as needed. |
3.0.0 |
| Edge Datapath Mempool High | Medium | edge, autonomous-edge, public-cloud-gateway | Edge node datapath mempool is high. <br><br>When event detected: "The datapath mempool usage for {mempool_name} on Edge node {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The datapath mempool usage for {mempool_name} on Edge node {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Log in as the root user and invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl mempool/show and edge-appctl -t /var/run/vmware/edge/dpd.ctl memory/show malloc_heap to check DPDK memory usage. |
3.0.0 |
| Edge Global ARP Table Usage High | Medium | edge, autonomous-edge, public-cloud-gateway | Edge node global ARP table usage is high. <br><br>When event detected: "Global ARP table usage on Edge node {entity_id} has reached {datapath_resource_usage}% which is above the high threshold for over two minutes. "<br><br>When event resolved: "Global ARP table usage on Edge node {entity_id} has reached below the high threshold. " |
Log in as the root user and invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/show and check if neigh cache usage is normal. If it is normal, invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/set_param max_entries to increase the ARP table size. |
3.0.0 |
| Edge NIC Out Of Receive Buffer | Medium | edge, autonomous-edge, public-cloud-gateway | Edge node NIC is out of RX ring buffers temporarily. <br><br>When event detected: "Edge NIC {edge_nic_name} receive ring buffer has overflowed by {rx_ring_buffer_overflow_percentage}% on Edge node {entity_id}. The missed packet count is {rx_misses} and processed packet count is {rx_processed}. "<br><br>When event resolved: "Edge NIC {edge_nic_name} receive ring buffer usage on Edge node {entity_id} is no longer overflowing. " |
Run the NSX CLI command get dataplane cpu stats on the edge node and check: |
3.0.0 |
| Edge NIC Out Of Transmit Buffer | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node NIC is out of TX ring buffers temporarily. <br><br>When event detected: "Edge NIC {edge_nic_name} transmit ring buffer has overflowed by {tx_ring_buffer_overflow_percentage}% on Edge node {entity_id}. The missed packet count is {tx_misses} and processed packet count is {tx_processed}. "<br><br>When event resolved: "Edge NIC {edge_nic_name} transmit ring buffer usage on Edge node {entity_id} is no longer overflowing. " |
1. If a lot of VMs are accommodated along with edge by the hypervisor then edge VM might not get time to run, hence the packets might not be retrieved by hypervisor. Then probably migrating the edge VM to a host with fewer VMs. |
3.0.0 |
| Edge NIC Transmit Queue Overflow | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node NIC transmit queue has overflowed temporarily. <br><br>When event detected: "Edge NIC {edge_nic_name} transmit queue {tx_queue_id} has overflowed by {tx_queue_overflow_percentage}% on Edge node {entity_id}. The missed packet count is {tx_misses} and processed packet count is {tx_processed}. "<br><br>When event resolved: "Edge NIC {edge_nic_name} transmit queue {tx_queue_id} on Edge node {entity_id} is no longer overflowing. " |
1. If a lot of VMs are accommodated along with edge by the hypervisor then edge VM might not get time to run, hence the packets might not be retrieved by hypervisor. Then probably migrating the edge VM to a host with fewer VMs. |
4.1.1 |
| Edge NIC Link Status Down | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node NIC link is down. <br><br>When event detected: "Edge node NIC {edge_nic_name} link is down. "<br><br>When event resolved: "Edge node NIC {edge_nic_name} link is up. " |
On the Edge node confirm if the NIC link is physically down by invoking the NSX CLI command get interfaces. If it is down, verify the cable connection. |
3.0.0 |
| Storage Error | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node disk is read-only. <br><br>When event detected: "The following disk partitions on the Edge node are in read-only mode: {disk_partition_name} "<br><br>When event resolved: "The following disk partitions on the Edge node have recovered from read-only mode: {disk_partition_name} " |
Examine the read-only partition to see if reboot resolves the issue or the disk needs to be replaced. Contact GSS for more information. |
3.0.1 |
| Datapath Thread Deadlocked | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node's datapath thread is in deadlock condition. <br><br>When event detected: "Edge node datapath thread {edge_thread_name} is deadlocked. "<br><br>When event resolved: "Edge node datapath thread {edge_thread_name} is free from deadlock. " |
Restart the dataplane service by invoking the NSX CLI command restart service dataplane. |
3.1.0 |
| Edge Datapath NIC Throughput Very High | Critical | edge, autonomous-edge, public-cloud-gateway | Edge node datapath NIC throughput is very high. <br><br>When event detected: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is at or above the very high threshold value of {nic_throughput_threshold}%. "<br><br>When event resolved: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is below the very high threshold value of {nic_throughput_threshold}%. " |
Examine the traffic thoughput levels on the NIC and determine whether configuration changes are needed. The 'get dataplane thoughput <seconds>' command can be used to monitor throughput. |
3.2.0 |
| Edge Datapath NIC Throughput High | Medium | edge, autonomous-edge, public-cloud-gateway | Edge node datapath NIC throughput is high. <br><br>When event detected: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is at or above the high threshold value of {nic_throughput_threshold}%. "<br><br>When event resolved: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is below the high threshold value of {nic_throughput_threshold}%. " |
Examine the traffic thoughput levels on the NIC and determine whether configuration changes are needed. The 'get dataplane thoughput <seconds>' command can be used to monitor throughput. |
3.2.0 |
| Failure Domain Down | Critical | edge, public-cloud-gateway | All members of failure domain are down. <br><br>When event detected: "All members of failure domain {transport_node_id} are down. "<br><br>When event resolved: "All members of failure domain {transport_node_id} are reachable. " |
1. On the Edge node identified by {transport_node_id}, check the connectivity to the management and control planes by invoking the NSX CLI command get managers and get controllers. |
3.2.0 |
| Micro Flow Cache Hit Rate Low | Medium | edge, autonomous-edge, public-cloud-gateway | Micro Flow Cache hit rate decreases and Datapath CPU is high. <br><br>When event detected: "Micro Flow Cache hit rate on Edge node {entity_id} has decreased below the specified threshold of {flow_cache_threshold}% for core {core_id}, and the Datapath CPU usage has increased for the last 30 minutes. "<br><br>When event resolved: "Flow Cache hit rate is in the normal range. " |
The Cache Flow hit rate has decreased for the last 30 minutes which is an indication that there may be degradation on Edge performance. The traffic will continue to be forwarded and you may not experience any issues. Check the datapath CPU utilization for Edge {entity_id} core {core_id} if it is high for the last 30 minutes. The Edge will have low flow-cache hit rate when there are continuously new flows getting created because the first packet of any new flow will be used to setup to flow-cache for fast path processing. You may want to increase your Edge appliance size or increase the number of Edge nodes used for Active/Active Gateways. |
3.2.2 |
| Mega Flow Cache Hit Rate Low | Medium | edge, autonomous-edge, public-cloud-gateway | Mega Flow Cache hit rate decreases and Datapath CPU is high. <br><br>When event detected: "Mega Flow Cache hit rate on Edge node {entity_id} has decreased below the specified threshold of {flow_cache_threshold}% for core {core_id}, and the Datapath CPU usage has increased for the last 30 minutes. "<br><br>When event resolved: "Flow Cache hit rate is in the normal range. " |
The Cache Flow hit rate has decreased for the last 30 minutes which is an indication that there may be degradation on Edge performance. The traffic will continue to be forwarded and you may not experience any issues. Check the datapath CPU utilization for Edge {entity_id} core {core_id} if it is high for the last 30 minutes. The Edge will have low flow-cache hit rate when there are continuously new flows getting created because the first packet of any new flow will be used to setup to flow-cache for fast path processing. You may want to increase your Edge appliance size or increase the number of Edge nodes used for Active/Active Gateways. |
3.2.2 |
| Flow Cache Deactivated | Critical | edge, autonomous-edge, public-cloud-gateway | Flow cache deactivated. <br><br>When event detected: "Flow cache on Edge Transport Node {transport_node_name} with UUID {entity_id} is deactivated. "<br><br>When event resolved: "Flow cache has now been activated on Edge node {transport_node_name}. " |
Please make sure the Flow cache for the Edge Transport Node {entity_id} and {transport_node_name} is activated. Deactivating the flow cache will cause traffic to be forwarded through the CPU. |
4.1.1 |
Endpoint Protection Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| EAM Status Down | Critical | manager | ESX Agent Manager (EAM) service on a compute manager is down. <br><br>When event detected: "ESX Agent Manager (EAM) service on compute manager {entity_id} is down. "<br><br>When event resolved: "ESX Agent Manager (EAM) service on compute manager {entity_id} is either up or compute manager {entity_id} has been removed. " |
Start the ESX Agent Manager (EAM) service. SSH into vCenter and invoke the command service vmware-eam start. |
3.0.0 |
| Partner Channel Down | Critical | esx | Host module and Partner SVM connection is down. <br><br>When event detected: "The connection between host module and Partner SVM {entity_id} is down. "<br><br>When event resolved: "The connection between host module and Partner SVM {entity_id} is up. " |
Refer to https://kb.vmware.com/s/article/85844 and make sure that Partner SVM {entity_id} is re-connected to the host module. You can also run the 'NxgiPlatform' Runbook on this particular Transport node for help in troubleshooting. |
3.0.0 |
Federation Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| RTEP BGP Down | High | edge, autonomous-edge, public-cloud-gateway | RTEP BGP neighbor down. <br><br>When event detected: "RTEP (Remote Tunnel Endpoint) BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is down. Reason: {failure_reason}. "<br><br>When event resolved: "RTEP (Remote Tunnel Endpoint) BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is established. " |
1. Invoke the NSX CLI command get logical-routers on the affected edge node. |
3.0.1 |
| LM To LM Synchronization Warning | Medium | manager | Synchronization between remote locations failed for more than 3 minutes. <br><br>When event detected: "The synchronization between {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for more than 3 minutes. "<br><br>When event resolved: "Remote locations {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized. " |
1. Invoke the NSX CLI command get site-replicator remote-sites to get connection state between the remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of leader resolution. In this case, wait for around 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step. |
3.0.1 |
| LM To LM Synchronization Error | High | manager | Synchronization between remote locations failed for more than 15 minutes. <br><br>When event detected: "The synchronization between {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for more than 15 minutes. "<br><br>When event resolved: "Remote sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized. " |
1. Invoke the NSX CLI command get site-replicator remote-sites to get connection state between the remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of leader resolution. In this case, wait for around 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step. |
3.0.1 |
| RTEP Connectivity Lost | High | manager | RTEP location connectivity lost. <br><br>When event detected: "Edge node {transport_node_name} lost RTEP (Remote Tunnel Endpoint) connectivity with remote location {remote_site_name}. "<br><br>When event resolved: "Edge node {transport_node_name} has restored RTEP (Remote Tunnel Endpoint) connectivity with remote location {remote_site_name}. " |
1. Invoke the NSX CLI command get logical-routers on the affected edge node {transport_node_name}. |
3.0.2 |
| GM To GM Split Brain | Critical | global-manager | Multiple Global Manager nodes are active at the same time. <br><br>When event detected: "Multiple Global Manager nodes are active: {active_global_managers}. Only one Global Manager node must be active at any time. "<br><br>When event resolved: "Global Manager node {active_global_manager} is the only active Global Manager node now. " |
Configure only one Global Manager node as active and all other Global Manager nodes as standby. |
3.1.0 |
| GM To GM Latency Warning | Medium | global-manager | Latency between Global Managers is higher than expected for more than 2 minutes <br><br>When event detected: "Latency is higher than expected between Global Managers {from_gm_path} and {to_gm_path}. "<br><br>When event resolved: "Latency is below expected levels between Global Managers {from_gm_path} and {to_gm_path}. " |
Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity. |
3.2.0 |
| GM To GM Synchronization Warning | Medium | global-manager | Active Global Manager to Standby Global Manager cannot synchronize <br><br>When event detected: "Active Global Manager {from_gm_path} to Standby Global Manager {to_gm_path} can not synchronize. "<br><br>When event resolved: "Synchronization from active Global Manager {from_gm_path} to standby {to_gm_path} is healthy. " |
Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) via ping. |
3.2.0 |
| GM To GM Synchronization Error | High | global-manager | Active Global Manager to Standby Global Manager cannot synchronize for more than 5 minutes <br><br>When event detected: "Active Global Manager {from_gm_path} to Standby Global Manager {to_gm_path} cannot synchronize for more than 5 minutes. "<br><br>When event resolved: "Synchronization from active Global Manager {from_gm_path} to standby {to_gm_path} is healthy. " |
Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) via ping. |
3.2.0 |
| GM To LM Synchronization Warning | Medium | global-manager, manager | Data synchronization between Global Manager (GM) and Local Manager (LM) failed. <br><br>When event detected: "Data synchronization between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for the {flow_identifier}. Reason: {sync_issue_reason} "<br><br>When event resolved: "Sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized for {flow_identifier}. " |
1. Check the network connectivity between remote site and local site via ping. |
3.2.0 |
| GM To LM Synchronization Error | High | global-manager, manager | Data synchronization between Global Manager (GM) and Local Manager (LM) failed for an extended period. <br><br>When event detected: "Data synchronization between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for the {flow_identifier} for an extended period. Reason: {sync_issue_reason}. "<br><br>When event resolved: "Sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized for {flow_identifier}. " |
1. Check the network connectivity between remote site and local site via ping. |
3.2.0 |
| Queue Occupancy Threshold Exceeded | Medium | manager, global-manager | Queue occupancy size threshold exceeded warning. <br><br>When event detected: "Queue ({queue_name}) used for syncing data between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached size {queue_size} which is at or above the maximum threshold of {queue_size_threshold}%. "<br><br>When event resolved: "Queue ({queue_name}) used for syncing data between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached size {queue_size} which is below the maximum threshold of {queue_size_threshold}%. " |
Queue size can exceed threshold due to communication issue with remote site or an overloaded system. Check system performance and /var/log/async-replicator/ar.log to see if there are any errors reported. |
3.2.0 |
| GM To LM Latency Warning | Medium | global-manager, manager | Latency between Global Manager and Local Manager is higher than expected for more than 2 minutes. <br><br>When event detected: "Latency between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached {latency_value} which is above the threshold value of {latency_threshold}. "<br><br>When event resolved: "Latency between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached {latency_value} which below the threshold value of {latency_threshold}. " |
1. Check the network connectivity between remote site and local site via ping. |
3.2.0 |
| LM Restore While Config Import In Progress | High | global-manager | Local Manager is restored while config import is in progress on Global Manager. <br><br>When event detected: "Config import from site {site_name}({site_id}) is in progress. However site {site_name}({site_id}) is restored from backup by the administrator leaving it in an inconsistent state. "<br><br>When event resolved: "Config inconsistency at site {site_name}({site_id}) is resolved. " |
1. Log in to NSX Global Manager appliance CLI. |
3.2.0 |
Gateway Firewall Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| IP Flow Count High | Medium | edge, public-cloud-gateway | The gateway firewall flow table usage for IP traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. <br><br>When event detected: "Gateway firewall flow table usage for IP on logical router {entity_id} has reached {firewall_ip_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "<br><br>When event resolved: "Gateway firewall flow table usage for non IP flows on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. " |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for IP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
| IP Flow Count Exceeded | Critical | edge, public-cloud-gateway | The gateway firewall flow table for IP traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. <br><br>When event detected: "Gateway firewall flow table usage for IP traffic on logical router {entity_id} has reached {firewall_ip_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "<br><br>When event resolved: "Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. " |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for IP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
| UDP Flow Count High | Medium | edge, public-cloud-gateway | The gateway firewall flow table usage for UDP traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. <br><br>When event detected: "Gateway firewall flow table usage for UDP on logical router {entity_id} has reached {firewall_udp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "<br><br>When event resolved: "Gateway firewall flow table usage for UDP on logical router {entity_id} has reached below the high threshold. " |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for UDP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
| UDP Flow Count Exceeded | Critical | edge, public-cloud-gateway | The gateway firewall flow table for UDP traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. <br><br>When event detected: "Gateway firewall flow table usage for UDP traffic on logical router {entity_id} has reached {firewall_udp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "<br><br>When event resolved: "Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold. " |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for UDP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
| ICMP Flow Count High | Medium | edge, public-cloud-gateway | The gateway firewall flow table usage for ICMP traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. <br><br>When event detected: "Gateway firewall flow table usage for ICMP on logical router {entity_id} has reached {firewall_icmp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "<br><br>When event resolved: "Gateway firewall flow table usage for ICMP on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. " |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for ICMP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
| ICMP Flow Count Exceeded | Critical | edge, public-cloud-gateway | The gateway firewall flow table for ICMP traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. <br><br>When event detected: "Gateway firewall flow table usage for ICMP traffic on logical router {entity_id} has reached {firewall_icmp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "<br><br>When event resolved: "Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. " |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for ICMP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
| TCP Half Open Flow Count High | Medium | edge, public-cloud-gateway | The gateway firewall flow table usage for TCP half-open traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. <br><br>When event detected: "Gateway firewall flow table usage for TCP on logical router {entity_id} has reached {firewall_halfopen_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "<br><br>When event resolved: "Gateway firewall flow table usage for TCP half-open on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. " |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for TCP half-open flow. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
| TCP Half Open Flow Count Exceeded | Critical | edge, public-cloud-gateway | The gateway firewall flow table for TCP half-open traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. <br><br>When event detected: "Gateway firewall flow table usage for TCP half-open traffic on logical router {entity_id} has reached {firewall_halfopen_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "<br><br>When event resolved: "Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. " |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for TCP half-open flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node. |
3.1.3 |
Groups Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Group Size Limit Exceeded | Medium | manager | The total number of translated group elements has exceeded the maximum limit. <br><br>When event detected: "Group {group_id} has at least {group_size} translated elements which is at or greater than the maximum numbers limit of {group_max_number_limit}. This can result in long processing times and can lead to timeouts and outages. The current count for each element type is as follows. IP sets:{ip_count}, MAC sets:{mac_count}, VIFS:{vif_count}, Logical switch ports:{lsp_count}, Logical router ports:{lrp_count}, AdGroups:{sid_count}. "<br><br>When event resolved: "Total number of elements in group {group_id} is below the maximum limit of {group_max_number_limit}. " |
1. Consider adjusting group elements in oversized group {group_id}. |
4.1.0 |
| Active Directory Groups Modified | Medium | manager | Active Directory Groups are modified on AD server. <br><br>When event detected: "Group {policy_group_name} contains an Active Directory Group member {old_base_distinguished_name} that is renamed on the Active Directory server with {new_base_distinguished_name}. Make sure the group has a valid Identity Group Member. "<br><br>When event resolved: "Group {policy_group_name} is updated with valid Active Directory Group member. " |
In the NSX UI, navigate to the Inventory | Groups tab to update the group definition of the applicable group with the new base distinguished name. Make sure the group has valid identity group members. |
4.1.2 |
High Availability Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Tier0 Gateway Failover | High | edge, autonomous-edge, public-cloud-gateway | A tier0 gateway has failed over. <br><br>When event detected: "The tier0 gateway {entity_id} failover from {previous_gateway_state} to {current_gateway_state}, service-router {service_router_id}. Reason: {failover_reason}. "<br><br>When event resolved: "The tier0 gateway {entity_id} is now up. " |
Invoke the NSX CLI command get logical-router <service_router_id> to identify the tier0 service-router vrf ID. Switch to the vrf context by invoking vrf <vrf-id> then invoke get high-availability status to determine the service that is down. |
3.0.0 |
| Tier1 Gateway Failover | High | edge, autonomous-edge, public-cloud-gateway | A tier1 gateway has failed over. <br><br>When event detected: "The tier1 gateway {entity_id} failover from {previous_gateway_state} to {current_gateway_state}, service-router {service_router_id}. Reason: {failover_reason} "<br><br>When event resolved: "The tier1 gateway {entity_id} is now up. " |
Invoke the NSX CLI command get logical-router <service_router_id> to identify the tier1 service-router vrf ID. Switch to the vrf context by invoking vrf <vrf-id> then invoke get high-availability status to determine the service that is down. |
3.0.0 |
| Tier0 Service Group Failover | High | edge, public-cloud-gateway | Service-group does not have an active instance. <br><br>When event detected: "Service-group cluster {entity_id} currently does not have an active instance. It is in state {ha_state} (where 0 is down, 1 is standby and 2 is active) on Edge node {transport_node_id} and in state {ha_state2} on Edge node {transport_node_id2}. Reason: {failover_reason}. "<br><br>When event resolved: "Tier0 service-group cluster {entity_id} now has one active instance on Edge node {transport_node_id}. " |
Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for reason for a service-group leaving active state. |
4.0.1 |
| Tier1 Service Group Failover | High | edge, public-cloud-gateway | Service-group does not have an active instance. <br><br>When event detected: "Service-group cluster {entity_id} currently does not have an active instance. It is in state {ha_state} (where 0 is down, 1 is standby and 2 is active) on Edge node {transport_node_id} and in state {ha_state2} on Edge node {transport_node_id2}. Reason: {failover_reason}. "<br><br>When event resolved: "Tier1 service-group cluster {entity_id} now has one active instance on Edge node {transport_node_id}. " |
Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for reason for a service-group leaving active state. |
4.0.1 |
| Tier0 Service Group Reduced Redundancy | Medium | edge, public-cloud-gateway | A standby instance in a service-group has failed. <br><br>When event detected: "Service-group cluster {entity_id} attached to Tier0 service-router {service_router_id} on Edge node {transport_node_id} has failed. As a result, the service-group cluster currently does not have a standby instance. Reason: {failover_reason} "<br><br>When event resolved: "Service-group cluster {entity_id} is in state {ha_state} (where 0 is down, 1 is standby and 2 is active) on Edge node {transport_node_id} and state {ha_state2} on Edge node {transport_node_id2}. " |
Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for failure reason for a previously standby service-group. |
4.0.1 |
| Tier1 Service Group Reduced Redundancy | Medium | edge, public-cloud-gateway | A standby instance in a service-group has failed. <br><br>When event detected: "Service-group cluster {entity_id} attached to Tier1 service-router {service_router_id} on Edge node {transport_node_id} has failed. As a result, the service-group cluster currently does not have a standby instance. Reason: {failover_reason} "<br><br>When event resolved: "Service-group cluster {entity_id} is in state {ha_state} (where 0 is down, 1 is standby and 2 is active) on Edge node {transport_node_id} and state {ha_state2} on Edge node {transport_node_id2}. " |
Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for failure reason for a previously standby service-group. |
4.0.1 |
Identity Firewall Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Connectivity To LDAP Server Lost | Critical | manager | Connectivity to LDAP server is lost. <br><br>When event detected: "The connectivity to LDAP server {ldap_server} is lost. "<br><br>When event resolved: "The connectivity to LDAP server {ldap_server} is restored. " |
Check |
3.1.0 |
| Error In Delta Sync | Critical | manager | Errors occurred while performing delta sync. <br><br>When event detected: "Errors occurred while performing delta sync with {directory_domain}. "<br><br>When event resolved: "No errors occurred while performing delta sync with {directory_domain}. " |
1. Check if there are any connectivity to LDAP server lost alarms. |
3.1.0 |
IDS IPS Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| IDPS Signature Bundle Download Failure | Medium | manager | Unable to download IDPS signature bundle from NTICS. <br><br>When event detected: "Unable to download IDPS signature bundle from NTICS. "<br><br>When event resolved: "IDPS signature bundle download from NTICS was successful. " |
Check if there is internet connectivity from NSX Manager to NTICS. |
4.1.1 |
Infrastructure Communication Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Edge Tunnels Down | Critical | edge, public-cloud-gateway | An Edge node's tunnel status is down. <br><br>When event detected: "The overall tunnel status of Edge node {entity_id} is down. "<br><br>When event resolved: "The tunnels of Edge node {entity_id} have been restored. " |
Invoke the NSX CLI command get tunnel-ports to get all tunnel ports, then check each tunnel's stats by invoking NSX CLI command get tunnel-port <UUID> stats to check if there are any drops. Also check /var/log/syslog if there are tunnel related errors. |
3.0.0 |
| Gre Tunnel Down | Critical | edge, autonomous-edge, public-cloud-gateway | GRE tunnel down. <br><br>When event detected: "GRE tunnel on Edge Transport Node {transport_node_name} with tunnel UUID {tunnel_uuid} is down. The traffic that is to be sent through the tunnel will be impacted. "<br><br>When event resolved: "GRE tunnel on Edge Transport Node {transport_node_name} with UUID {tunnel_uuid} is up. " |
GRE tunnel goes down when GRE keepalives are not received for dead multiplier times. Check connectivity of GRE endpoints. |
4.1.2 |
Infrastructure Service Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Service Status Unknown On DPU | Critical | dpu | Service's status on DPU is abnormal. <br><br>When event detected: "The service {service_name} on DPU {dpu_id} has been unresponsive for 10 seconds. "<br><br>When event resolved: "The service {service_name} on DPU {dpu_id} is responsive again. " |
Verify {service_name} service on DPU {dpu_id} is still running by invoking /etc/init.d/{service_name} status. If the service is reported as running, it may need to get restarted which can be done by /etc/init.d/{service_name} restart. Rerun the status command to verify the service is now running. If restarting the service does not resolve the issue or if the issue reoccurs after a successful restart, contact VMware Support. |
4.0.0 |
| Service Status Unknown | Critical | esx, kvm, bms, edge, manager, public-cloud-gateway, global-manager | Service's status is abnormal. <br><br>When event detected: "The service {service_name} has been unresponsive for {heartbeat_threshold} seconds. "<br><br>When event resolved: "The service {service_name} is responsive again. " |
Verify {service_name} service is still running by invoking /etc/init.d/{service_name} status. If the service is reported as running, it may need to get restarted which can be done by /etc/init.d/{service_name} restart. Rerun the status command to verify the service is now running. If the script /etc/init.d/{service_name} is unavailable, invoke systemctl status {service_name} and restart by systemctl restart {service_name} with root priviledges. If restarting the service does not resolve the issue or if the issue reoccurs after a successful restart, contact VMware Support. |
3.1.0 |
| Metrics Delivery Failure | Critical | esx, bms, edge, manager, public-cloud-gateway, global-manager | Failed to deliver metrics to the specified target. <br><br>When event detected: "Failed to deliver metrics from SHA to target {metrics_target_alias}({metrics_target_address}:{metrics_target_port}). "<br><br>When event resolved: "Metrics delivery to target {metrics_target_alias}({metrics_target_address}:{metrics_target_port}) recovered. " |
User should perform the following checks in order to exclude the problem causing the failure: For NAPP: 1. Check if target address {metrics_target_address} and port {metrics_target_port} passed down to connect is the expected target, 2. Check if the certificate for the secure connection is correct by grep 'nsx-sha' {log_file} | grep 'NAPP Profile'(private key would be shielded), 3. Check if target {metrics_target_address} is reachable, 4. Check if obvious transmission failure could be observed in SHA by grep 'Failed to send one msg' {log_file}, For metrics mux: Note: {metrics_target_address} actually is the metrics mux which would bridge the metrics to the metrics instance which is the real target behind the VDP ingress. 1. Check if the picked manager {metrics_target_address} is onboarded, 2. Check if the certificate for the secure connection is correct by grep 'nsx-sha' {log_file} | grep 'Metrics Mux Profile'(private key would be shielded), 3. Check metrics agent status on the manager {metrics_target_address}, by /etc/init.d/nsx-metrics-agents status, 4. Check if obvious transmission failure could be observed in SHA by grep 'Failed to send one msg' {log_file}, Common part: 1. Check if ALLOW firewall rule is installed on the node by iptables -S OUTPUT | grep {metrics_target_port}(EDGE/Manager) or localcli network firewall ruleset list | grep nsx-sha-tsdb(ESX), 2. Restart SHA daemon to see if it could be solved by /etc/init.d/netopa restart(ESX) or /etc/init.d/nsx-sha restart(EDGE/Manager). |
4.1.0 |
| Edge Service Status Down (deprecated) | Critical | edge, autonomous-edge, public-cloud-gateway | Edge service is down for at least one minute. <br><br>When event detected: "The service {edge_service_name} is down for at least one minute. {service_down_reason} "<br><br>When event resolved: "The service {edge_service_name} is up. " |
On the Edge node, verify the service hasn't exited due to an error by looking for core files in the /var/log/core directory. In addition, invoke the NSX CLI command get services to confirm whether the service is stopped. If so, invoke start service <service-name> to restart the service. |
3.0.0 |
| Edge Service Status Changed | Medium | edge, autonomous-edge, public-cloud-gateway | Edge service status has changed. <br><br>When event detected: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}. {service_down_reason} "<br><br>When event resolved: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}. " |
On the Edge node, verify the service hasn't exited due to an error by looking for core files in the /var/log/core directory. In addition, invoke the NSX CLI command get services to confirm whether the service is stopped. If so, invoke start service <service-name> to restart the service. |
3.0.0 |
| Application Crashed | Critical | global-manager, autonomous-edge, bms, edge, esx, kvm, manager, public-cloud-gateway | Application has crashed and generated a core dump. <br><br>When event detected: "Application on NSX node {node_display_or_host_name} has crashed. The number of core files found is {core_dump_count}. Collect the Support Bundle including core dump files and contact VMware Support team. "<br><br>When event resolved: "All core dump files are withdrawn from system. " |
Collect Support Bundle for NSX node {node_display_or_host_name} using NSX Manager UI or API. Note, core dumps can be set to move or copy into NSX Tech Support Bundle in order to remove or preserve the local copy on node. Copy of Support Bundle with core dump files is essential for VMware Support team to troubleshoot the issue and it is best recommended to save a latest copy of Tech Support Bundle including core dump files before removing core dump files from system. Refer KB article for more details. |
4.0.0 |
| Application Crashed On DPU | Critical | dpu | Application has crashed and generated a core dump on dpu. <br><br>When event detected: "Application on DPU {dpu_id} has crashed. The number of core files found is {core_dump_count}. Collect the Support Bundle including core dump files and contact VMware Support team. "<br><br>When event resolved: "All core dump files are withdrawn from system. " |
Collect Support Bundle for DPU {dpu_id} using NSX Manager UI or API. Note, core dumps can be set to move or copy into NSX Tech Support Bundle in order to remove or preserve the local copy on node. Copy of Support Bundle with core dump files is essential for VMware Support team to troubleshoot the issue and it is best recommended to save a latest copy of Tech Support Bundle including core dump files before removing core dump files from system. Refer KB article for more details. |
4.1.1 |
| Compute Manager Lost Connectivity | Critical | manager, global-manager | Compute Manager connection status is down. <br><br>When event detected: "Connection status of Compute Manager {cm_name} having id {cm_id} is DOWN. "<br><br>When event resolved: "Connection status of Compute Manager {cm_name} having id {cm_id} is UP again. " |
Check the errors present for Compute Manager {cm_name} having id {cm_id} and resolve the errors. |
4.1.2 |
Intelligence Communication Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| TN Flow Exporter Disconnected (deprecated) | High | esx, kvm, bms | A Transport node is disconnected from its NSX Messaging Broker. <br><br>When event detected: "The flow exporter on Transport node {entity_id} is disconnected from its messaging broker {messaging_broker_info}. Data collection is affected. "<br><br>When event resolved: "The flow exporter on Transport node {entity_id} has reconnected to its messaging broker {messaging_broker_info}. " |
Restart the messaging service if it is not running. Resolve the network connection failure between the Transport node flow exporter and its NSX messaging broker. |
3.0.0 |
Intelligence Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| CPU Usage Very High (deprecated) | Critical | manager, intelligence | Intelligence node CPU usage is very high. <br><br>When event detected: "The CPU usage on Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage on Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%. " |
Use the top command to check which processes have the most CPU usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
3.0.0 |
| CPU Usage High (deprecated) | Medium | manager, intelligence | Intelligence node CPU usage is high. <br><br>When event detected: "The CPU usage on Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage on Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%. " |
Use the top command to check which processes have the most CPU usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
3.0.0 |
| Memory Usage Very High (deprecated) | Critical | manager, intelligence | Intelligence node memory usage is very high. <br><br>When event detected: "The memory usage on Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage on Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%. " |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
3.0.0 |
| Memory Usage High (deprecated) | Medium | manager, intelligence | Intelligence node memory usage is high. <br><br>When event detected: "The memory usage on Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage on Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%. " |
Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved. |
3.0.0 |
| Disk Usage Very High (deprecated) | Critical | manager, intelligence | Intelligence node disk usage is very high. <br><br>When event detected: "The disk usage of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%. " |
Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed. |
3.0.0 |
| Disk Usage High (deprecated) | Medium | manager, intelligence | Intelligence node disk usage is high. <br><br>When event detected: "The disk usage of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%. " |
Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed. |
3.0.0 |
| Data Disk Partition Usage Very High (deprecated) | Critical | manager, intelligence | Intelligence node data disk partition usage is very high. <br><br>When event detected: "The disk usage of disk partition /data on Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of disk partition /data on Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%. " |
Stop NSX intelligence data collection until the disk usage is below the threshold. In the NSX UI, navigate to System | Appliances | NSX Intelligence Appliance. Then click ACTONS, Stop Collecting Data. |
3.0.0 |
| Data Disk Partition Usage High (deprecated) | Medium | manager, intelligence | Intelligence node data disk partition usage is high. <br><br>When event detected: "The disk usage of disk partition /data on Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of disk partition /data on Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%. " |
Stop NSX intelligence data collection until the disk usage is below the threshold. Examine disk partition /data and see if there are any unexpected large files that can be removed. |
3.0.0 |
| Storage Latency High (deprecated) | Medium | manager, intelligence | Intelligence node storage latency is high. <br><br>When event detected: "The storage latency of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold} milliseconds. "<br><br>When event resolved: "The storage latency of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold} milliseconds. " |
Transient high storage latency may happen due to spike of I/O requests. If storage latency remains high for more than 30 minutes, consider deploying NSX Intelligence appliance in a low latency disk, or not sharing the same storage device with other VMs. |
3.1.0 |
| Node Status Degraded (deprecated) | High | manager, intelligence | Intelligence node status is degraded. <br><br>When event detected: "Intelligence node {intelligence_node_id} is degraded. "<br><br>When event resolved: "Intelligence node {intelligence_node_id} is running properly. " |
Invoke the NSX API GET /napp/api/v1/platform/monitor/category/health to check which specific pod is down and the reason behind it. Invoke the following CLI command to restart the degraded service: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> |
3.0.0 |
IPAM Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| IP Block Usage Very High | Medium | manager | IP block usage is very high. <br><br>When event detected: "IP block usage of {intent_path} is very high. IP block nearing its total capacity, creation of subnet using IP block might fail. "<br><br>When event resolved: "IP block usage of {intent_path} is below threshold level. " |
Review IP block usage. Use new IP block for resource creation or delete unused IP subnet from the IP block. To check subnet being used for IP Block. From NSX UI, navigate to Networking | IP Address pools | IP Address pools tab. Select IP pools where IP block being used, check Subnets and Allocated IPs column on UI. If no allocation has been used for the IP pool and it is not going to be used in future then delete subnet or IP pool. Use following API to check if IP block being used by IP pool and also check if any IP allocation done: To get configured subnets of an IP pool, invoke the NSX API GET /policy/api/v1/infra/ip-pools/<ip-pool>/ip-subnets To get IP allocations, invoke the NSX API GET /policy/api/v1/infra/ip-pools/<ip-pool>/ip-allocations Note: Deletion of IP pool/subnet should only be done if it does not have any allocated IPs and it is not going to be used in future. |
3.1.2 |
| IP Pool Usage Very High | Medium | manager | IP pool usage is very high. <br><br>When event detected: "IP pool usage of {intent_path} is very high. IP pool nearing its total capacity. Creation of entity/service depends on IP being allocated from IP pool might fail. "<br><br>When event resolved: "IP pool usage of {intent_path} is normal now. " |
Review IP pool usage. Release unused ip allocations from IP pool or create new IP pool and use it. From NSX UI navigate to Networking | IP Address pools | IP Address pools tab. Select IP pools and check Allocated IPs column, this will show IPs allocated from the IP pool. If user see any IPs are not being used then those IPs can be released. To release unused IP allocations, invoke the NSX API DELETE /policy/api/v1/infra/ip-pools/<ip-pool>/ip-allocations/<ip-allocation> |
3.1.2 |
Licenses Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Sha Metering Plugin Down | Critical | manager | License SHA metering plugin on ESXi host is down or unhealthy. <br><br>When event detected: "The license SHA metering plugin on ESXi host {transport_node_name} ({transport_node_id}) is down or unhealthy for three consecutive days. Due to this, metering data from ESXi host is impacted. "<br><br>When event resolved: "The license SHA metering plugin on ESXi host {transport_node_name} ({transport_node_id}) is up or healthy again. " |
To check license SHA metering plugin status, invoke the NSX API GET /api/v1/systemhealth/plugins/status/{transport_node_id}. If there is no data in response, it means the connection between NSX Manager and ESXi host is broken or SHA process in ESXi host is down. If there is plugin status in response, locate plugin status by name license_metering_monitor and check content in detail. To restore license SHA metering plugin in ESXi host, log into the ESXi host and restart SHA process by invoking the command /etc/init.d/netopa restart. |
4.1.2 |
| License Expired | Critical | global-manager, manager | A license has expired. <br><br>When event detected: "The {license_edition_type} license key ending with {displayed_license_key}, has expired. "<br><br>When event resolved: "The expired {license_edition_type} license key ending with {displayed_license_key}, has been removed, updated or is no longer about to expire. " |
Add a new, non-expired license using the NSX UI by navigating to System | Licenses then click ADD and specify the key of the new license. The expired license should be deleted by checking the checkbox of the license, then click DELETE. |
3.0.0 |
| License Is About To Expire | Medium | global-manager, manager | A license is about to expired. <br><br>When event detected: "The {license_edition_type} license key ending with {displayed_license_key}, is about to expire. "<br><br>When event resolved: "The expiring {license_edition_type} license key ending with {displayed_license_key}, has been removed, updated or is no longer about to expire. " |
The license is about to expire in several days. Plan to add a new, non-expiring license using the NSX UI by navigating to System | Licenses then click ADD and specify the key of the new license. The expired license should be deleted by checking the checkbox of the license, then click DELETE. |
3.0.0 |
Load Balancer Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| LB CPU Very High | Medium | edge | Load balancer CPU usage is very high. <br><br>When event detected: "The CPU usage of load balancer {entity_id} is very high. The threshold is {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of load balancer {entity_id} is low enough. The threshold is {system_usage_threshold}%. " |
If the load balancer CPU utilization is higher than system usage threshold, the workload is too high for this load balancer. Rescale the load balancer service by changing the load balancer size from small to medium or from medium to large. If the CPU utilization of this load balancer is still high, consider adjusting the Edge appliance form factor size or moving load balancer services to other Edge nodes for the applicable workload. |
3.0.0 |
| LB Status Degraded | Medium | manager | Load balancer service is degraded. <br><br>When event detected: "The load balancer service {entity_id} is degraded. "<br><br>When event resolved: "The load balancer service {entity_id} is not degraded. " |
For centralized load balancer: Check the load balancer status on standby Edge node as the degraded status means the load balancer status on standby Edge node is not ready. On standby Edge node, invoke the NSX CLI command get load-balancer <lb-uuid> status. If the LB-State of load balancer service is not_ready or there is no output, make the Edge node enter maintenance mode, then exit maintenance mode. For distributed load balancer: |
3.1.2 |
| DLB Status Down | Critical | manager | Distributed load balancer service is down. <br><br>When event detected: "The distributed load balancer service {entity_id} is down. "<br><br>When event resolved: "The distributed load balancer service {entity_id} is up. " |
On ESXi host node, invoke the NSX CLI command `get load-balancer <lb-uuid> status`. If 'Conflict LSP' is reported, check whether this LSP is attached to other load balancer service. Check whether this conflict is acceptable. If 'Not Ready LSP' is reported, check the status of this LSP by invoking NSX CLI command get logical-switch-port status. |
3.1.2 |
| LB Status Down | Critical | edge | Centralized load balancer service is down. <br><br>When event detected: "The centralized load balancer service {entity_id} is down. "<br><br>When event resolved: "The centralized load balancer service {entity_id} is up. " |
On active Edge node, check load balancer status by invoking the NSX CLI command get load-balancer <lb-uuid> status. If the LB-State of load balancer service is not_ready or there is no output, make the Edge node enter maintenance mode, then exit maintenance mode. |
3.0.0 |
| Virtual Server Status Down | Medium | edge | Load balancer virtual service is down. <br><br>When event detected: "The load balancer virtual server {entity_id} is down. "<br><br>When event resolved: "The load balancer virtual server {entity_id} is up. " |
Consult the load balancer pool to determine its status and verify its configuration. If incorrectly configured, reconfigure it and remove the load balancer pool from the virtual server then re-add it to the virtual server again. |
3.0.0 |
| Pool Status Down | Medium | edge | Load balancer pool is down. <br><br>When event detected: "The load balancer pool {entity_id} status is down. "<br><br>When event resolved: "The load balancer pool {entity_id} status is up " |
Consult the load balancer pool to determine which members are down by invoking the NSX CLI command get load-balancer <lb-uuid> pool <pool-uuid> status or NSX API GET /policy/api/v1/infra/lb-services/<lb-service-id>/lb-pools/<lb-pool-id>/detailed-status If DOWN or UNKNOWN is reported, verify the pool member. Check network connectivity from the load balancer to the impacted pool members. Validate application health of each pool member. Also validate the health of each pool member using the configured monitor. When the health of the member is established, the pool member status is updated to healthy based on the 'Rise Count' configuration in the monitor. Remediate the issue by rebooting the pool member or make the Edge node enter maintenance mode, then exit maintenance mode. |
3.0.0 |
| LB Edge Capacity In Use High | Medium | edge | Load balancer usage is high. <br><br>When event detected: "The usage of load balancer service in Edge node {entity_id} is high. The threshold is {system_usage_threshold}%. "<br><br>When event resolved: "The usage of load balancer service in Edge node {entity_id} is low enough. The threshold is {system_usage_threshold}%. " |
If multiple LB instances have been configurerd in this Edge node, deploy a new Edge node and move some LB instances to that new Edge node. If only a single LB instance (small/medium/etc) has been configured in an Edge node of same size (small/medium/etc), deploy a new Edge of bigger size and move the LB instance to that new Edge node. |
3.1.2 |
| LB Pool Member Capacity In Use Very High | Critical | edge | Load balancer pool member usage is very high. <br><br>When event detected: "The usage of pool members in Edge node {entity_id} is very high. The threshold is {system_usage_threshold}%. "<br><br>When event resolved: "The usage of pool members in Edge node {entity_id} is low enough. The threshold is {system_usage_threshold}%. " |
Deploy a new Edge node and move the load balancer service from existing Edge nodes to the newly deployed Edge node. |
3.1.2 |
| Load Balancing Configuration Not Realized Due To Lack Of Memory | Medium | edge | Load balancer configuration is not realized due to high memory usage on Edge node. <br><br>When event detected: "The load balancer configuration {entity_id} is not realized, due to high memory usage on Edge node {transport_node_id}. "<br><br>When event resolved: "The load balancer configuration {entity_id} is realized on {transport_node_id}. " |
Prefer defining small and medium sized load balancers over large sized load balancers. Spread out load balancer services among the available Edge nodes. Reduce number of Virtual Servers defined. |
3.2.0 |
Logging Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Log Retention Time Too Low | Info | esx, edge, manager, public-cloud-gateway, global-manager | Log files will be deleted before the set retention period. <br><br>When event detected: "One or more log files on the node will be deleted before the set retention period due to excessive logging. {log_context} "<br><br>When event resolved: "Estimated log maximum duration is now equal to or larger than the expected duration. " |
Follow the below steps to back up the log files before they are deleted. 1. Get the detailed report of log files on the node: {report_file_path}. 2. Review the Estimated Maximum Duration and Desired Duration in the detailed report, Estimated Maximum duration will indicate if the log files will be deleted before the retention period indicated by Desired Duration. If needed, backup old log files. |
4.1.1 |
| Remote Logging Not Configured | Medium | global-manager, manager | Remote logging not configured. <br><br>When event detected: "One or more {node_type_name} nodes are not currently configured to forward log messages to a remote logging server. "<br><br>When event resolved: "All {node_type_name} nodes are configured to forward log messages to at least one remote logging server currently. " |
1. Invoke API GET /api/v1/configs/central-config/logging-servers to see the nodes on which remote logging is not configured. |
4.1.2 |
Malware Prevention Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Service Status Down | High | manager | Service status is down. <br><br>When event detected: "Service {mps_service_name} is not running on {transport_node_name}. "<br><br>When event resolved: "Service {mps_service_name} is running properly on {transport_node_name}. " |
On the {transport_node_type} transport node identified by {transport_node_name}, invoke the NSX CLI get services to check the status of {mps_service_name}. Inspect /var/log/syslog to find any suspecting error(s). Refer sections for {transport_node_type} transport node in KB. |
4.0.1 |
| File Extraction Service Unreachable | High | manager | Service status is degraded. <br><br>When event detected: "Service {mps_service_name} is degraded on {transport_node_name}. Unable to communicate with file extraction functionality. All file extraction abilities on the {transport_node_name} are paused. "<br><br>When event resolved: "Service {mps_service_name} is running properly on {transport_node_name}. " |
On the {transport_node_type} transport node identified by {transport_node_name}, check the status of {mps_service_name} that is responsible for file_extraction. Inspect /var/log/syslog to find any suspecting error(s). Refer sections for {transport_node_type} transport node in KB. |
4.0.1 |
| Database Unreachable | High | manager | Service status is degraded. <br><br>When event detected: "Service {mps_service_name} is degraded on NSX Application Platform. It is unable to communicate with Malware Prevention database. "<br><br>When event resolved: "Service {mps_service_name} is running properly on NSX Application Platform. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is down and the reason behind it. Invoke the following CLI command to restart the degraded service: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> Determine the status of Malware Prevention Database service. |
4.0.1 |
| Analyst API Service Unreachable | High | manager | Service status is degraded. <br><br>When event detected: "Service {mps_service_name} is degraded on NSX Application Platform. It is unable to communicate with analyst_api service. Inspected file verdicts may not be up to date. "<br><br>When event resolved: "Service {mps_service_name} is running properly on NSX Application Platform. " |
Analyst API service external to datacenter is unreachable. Check connectivity to internet. This could be temporary and may restore on its own. If this doesn't happen in minutes then it is advisable to collect the NSX Application platform support bundle and raise a support ticket with VMware support team. |
4.0.1 |
| NTICS Reputation Service Unreachable | High | manager | Service status is degraded. <br><br>When event detected: "Service {mps_service_name} is degraded on NSX Application Platform. It is unable to communicate with NTICS reputation service. Inspected file reputations may not be up to date. "<br><br>When event resolved: "Service {mps_service_name} is running properly on NSX Application Platform. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is down and the reason behind it. Invoke the following CLI command to restart the degraded service: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> Determine if access to NTICS service is down. |
4.1.0 |
| Service Disk Usage Very High | High | manager | Service disk usage is very high. <br><br>When event detected: "The {disk_purpose} disk usage for service {mps_service_name} on {transport_node_name} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The {disk_purpose} disk usage for service {mps_service_name} on {transport_node_name} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
On the {transport_node_type} transport node identified by {transport_node_name}, you may reduce the file retention period or alternatively in case of host node, reduce the Malware Prevention load by moving some VMs to another Host node. Refer sections for {transport_node_type} transport node in KB. |
4.1.2 |
| Service Disk Usage High | Medium | manager | Service disk usage is high. <br><br>When event detected: "The {disk_purpose} disk usage for service {mps_service_name} on {transport_node_name} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The {disk_purpose} disk usage for service {mps_service_name} on {transport_node_name} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
On the {transport_node_type} transport node identified by {transport_node_name}, you may reduce the file retention period or alternatively in case of host node, reduce the Malware Prevention load by moving some VMs to another Host node. Refer sections for {transport_node_type} transport node in KB. |
4.1.2 |
| Service VM CPU Usage High | Medium | manager | Malware Prevention Service VM CPU usage is high. <br><br>When event detected: "The CPU usage on Malware Prevention Service VM {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage on Malware prevention Service VM {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM. |
4.1.2 |
| Service VM CPU Usage Very High | High | manager | Malware Prevention Service VM CPU usage is very high. <br><br>When event detected: "The CPU usage on Malware Prevention Service VM {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage on Malware prevention Service VM {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM. |
4.1.2 |
| Service VM Memory Usage High | Medium | manager | Malware Prevention Service VM memory usage is high. <br><br>When event detected: "The memory usage on Malware Prevention Service VM {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage on Malware prevention Service VM {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM. |
4.1.2 |
| Service VM Memory Usage Very High | High | manager | Malware Prevention Service VM memory usage is very high. <br><br>When event detected: "The memory usage on Malware Prevention Service VM {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage on Malware prevention Service VM {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM. |
4.1.2 |
Manager Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Manager CPU Usage Very High | Critical | global-manager, manager | Manager node CPU usage is very high. <br><br>When event detected: "The CPU usage on Manager node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage on Manager node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
3.0.0 |
| Manager CPU Usage High | Medium | global-manager, manager | Manager node CPU usage is high. <br><br>When event detected: "The CPU usage on Manager node {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage on Manager node {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
3.0.0 |
| Manager Memory Usage Very High | Critical | global-manager, manager | Manager node memory usage is very high. <br><br>When event detected: "The memory usage on Manager node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage on Manager node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
3.0.0 |
| Manager Memory Usage High | Medium | global-manager, manager | Manager node memory usage is high. <br><br>When event detected: "The memory usage on Manager node {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage on Manager node {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size. |
3.0.0 |
| Manager Disk Usage Very High | Critical | global-manager, manager | Manager node disk usage is very high. <br><br>When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
3.0.0 |
| Manager Disk Usage High | Medium | global-manager, manager | Manager node disk usage is high. <br><br>When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Examine the partition with high usage and see if there are any unexpected large files that can be removed. |
3.0.0 |
| Manager Config Disk Usage Very High | Critical | global-manager, manager | Manager node config disk usage is very high. <br><br>When event detected: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of high disk usage by the NSX Datastore service under the /config/corfu directory. "<br><br>When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py |
3.0.0 |
| Manager Config Disk Usage High | Medium | global-manager, manager | Manager node config disk usage is high. <br><br>When event detected: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk usage by the NSX Datastore service under the /config/corfu directory. "<br><br>When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py |
3.0.0 |
| Operations DB Disk Usage Very High | Critical | manager | Manager node nonconfig disk usage is very high. <br><br>When event detected: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of high disk usage by the NSX Datastore service under the /nonconfig/corfu directory. "<br><br>When event resolved: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. " |
Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig |
3.0.1 |
| Operations DB Disk Usage High | Medium | manager | Manager node nonconfig disk usage is high. <br><br>When event detected: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk usage by the NSX Datastore service under the /nonconfig/corfu directory. "<br><br>When event resolved: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. " |
Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig |
3.0.1 |
| Duplicate IP Address | Medium | manager | Manager node's IP address is in use by another device. <br><br>When event detected: "Manager node {entity_id} IP address {duplicate_ip_address} is currently being used by another device in the network. "<br><br>When event resolved: "The device using the IP address assigned to Manager node {entity_id} appears to no longer be using {duplicate_ip_address}. " |
1. Determine which device is using the Manager's IP address and assign the device a new IP address. Note, reconfiguring the Manager to use a new IP address is not supported. |
3.0.0 |
| Storage Error | Critical | global-manager, manager | Manager node disk is read-only. <br><br>When event detected: "The following disk partition on the Manager node {entity_id} is in read-only mode: {disk_partition_name} "<br><br>When event resolved: "The following disk partition on the Manager node {entity_id} has recovered from read-only mode: {disk_partition_name} " |
Examine the read-only partition to see if reboot resolves the issue or the disk needs to be replaced. Contact GSS for more information. |
3.0.2 |
| Missing DNS Entry For Manager FQDN | Critical | global-manager, manager | The DNS entry for the Manager FQDN is missing. <br><br>When event detected: "The DNS configuration for Manager node {manager_node_name} ({entity_id}) is incorrect. The Manager node is dual-stack and/or CA-signed API certificate is used, but the IP address(es) of the Manager node do not resolve to an FQDN or resolve to different FQDNs. "<br><br>When event resolved: "The DNS configuration for Manager node {manager_node_name} ({entity_id}) is correct. Either the Manager node is not dual-stack and CA-signed API certificate is no longer used, or the IP address(es) of the Manager node resolve to the same FQDN. " |
1. Ensure proper DNS servers are configured in the Manager node. |
4.1.0 |
| Missing DNS Entry For Vip FQDN | Critical | manager | Missing FQDN entry for the Manager VIP. <br><br>When event detected: "In case of dual stack or CA-signed API certificate for a NSX Manager, virtual IPv4 address {ipv4_address} and virtual IPv6 address {ipv6_address} for Manager node {entity_id} should resolve to the same FQDN. "<br><br>When event resolved: "Dual stack VIP addresses for Manager node {entity_id} resolved to same FQDN. " |
Examine the DNS entry for the VIP addresses to see if they resolve to the same FQDN. |
4.1.0 |
MTU Check Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| MTU Mismatch Within Transport Zone | High | manager | MTU configuration mismatch between Transport Nodes attached to the same Transport Zone. <br><br>When event detected: "MTU configuration mismatch between Transport Nodes (ESXi, KVM and Edge) attached to the same Transport Zone. MTU values on all switches attached to the same Transport Zone not being consistent will cause connectivity issues. "<br><br>When event resolved: "All MTU values between Transport Nodes attached to the same Transport Zone are consistent now. " |
1. Navigate to System | Fabric | Settings | MTU Configuration Check | Inconsistent on the NSX UI to check more mismatch details. |
3.2.0 |
| Global Router MTU Too Big | Medium | manager | The global router MTU configuration is bigger than the MTU of overlay Transport Zone. <br><br>When event detected: "The global router MTU configuration is bigger than MTU of switches in overlay Transport Zone which connects to Tier0 or Tier1. Global router MTU value should be less than all switches MTU value by at least a 100 as we require 100 quota for Geneve encapsulation. "<br><br>When event resolved: "The global router MTU is less than the MTU of overlay Transport Zone now. " |
1. Navigate to System | Fabric | Settings | MTU Configuration Check | Inconsistent on the NSX UI to check more mismatch details. |
3.2.0 |
NAT Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| SNAT Port Usage On Gateway Is High | Critical | edge, public-cloud-gateway | SNAT port usage on the Gateway is high. <br><br>When event detected: "SNAT ports usage on logical router {entity_id} for SNAT IP {snat_ip_address} has reached the high threshold value of {system_usage_threshold}%. New flows will not be SNATed when usage reaches the maximum limit. "<br><br>When event resolved: "SNAT ports usage on logical router {entity_id} for SNAT IP {snat_ip_address} has reached below the high threshold value of {system_usage_threshold}%. " |
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> connection state by using the right interface uuid and check various SNAT mappings for the SNAT IP {snat_ip_address}. Check traffic flows going through the gateway is not a denial-of-service attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider adding more SNAT IP addresses to distribute the load or route new traffic to another Edge node. |
3.2.0 |
NCP Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| NCP Plugin Down | Critical | manager | Manager Node has detected the NCP is down or unhealthy. <br><br>When event detected: "Manager Node has detected the NCP is down or unhealthy. "<br><br>When event resolved: "Manager Node has detected the NCP is up or healthy again. " |
To find the clusters which are having issues, use the NSX UI and navigate to the Alarms page. The Entity name value for this alarm instance identifies the cluster name. Or invoke the NSX API GET /api/v1/systemhealth/container-cluster/ncp/status to fetch all cluster statuses and determine the name of any clusters that report DOWN or UNKNOWN. Then on the NSX UI Inventory | Container | Clusters page find the cluster by name and click the Nodes tab which lists all Kubernetes and PAS cluster members. For Kubernetes cluster: |
3.0.0 |
Node Agents Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Node Agents Down On DPU | High | dpu | The agents running inside the Node VM appear to be down on DPU. <br><br>When event detected: "The agents running inside the Node VM appear to be down on DPU {dpu_id}. "<br><br>When event resolved: "The agents inside the Node VM are running on DPU {dpu_id}. " |
1. If Vmk50 on DPU {dpu_id} is missing, refer to this Knowledge Base article https://kb.vmware.com/s/article/67432 . |
4.0.0 |
| Node Agents Down | High | esx, kvm | The agents running inside the Node VM appear to be down. <br><br>When event detected: "The agents running inside the Node VM appear to be down. "<br><br>When event resolved: "The agents inside the Node VM are running. " |
For ESX: |
3.0.0 |
NSX Application Platform Communication Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Manager Disconnected | High | manager, intelligence | The NSX Application Platform cluster is disconnected from the NSX management cluster. <br><br>When event detected: "The NSX Application Platform cluster {napp_cluster_id} is disconnected from the NSX management cluster. "<br><br>When event resolved: "The NSX Application Platform cluster {napp_cluster_id} is reconnected to the NSX management cluster. " |
Check whether the manager cluster certificate, manager node certificates, kafka certificate and ingress certificate match on both NSX Manager and the NSX Application Platform cluster. Check expiration dates of the above mentioned certificates to make sure they are valid. Check the network connection between NSX Manager and NSX Application Platform cluster and resolve any network connection failures. |
3.2.0 |
| Delay Detected In Messaging Rawflow | Critical | manager, intelligence | Slow data processing detected in messaging topic Raw Flow. <br><br>When event detected: "The number of pending messages in the messaging topic Raw Flow is above the pending message threshold of {napp_messaging_lag_threshold}. "<br><br>When event resolved: "The number of pending messages in the messaging topic Raw Flow is below the pending message threshold of {napp_messaging_lag_threshold}. " |
Add nodes and then scale up the NSX Application Platform cluster. If the bottleneck can be attributed to a specific service, for example, the analytics service, then scale up the specific service when the new nodes are added. |
3.2.0 |
| Delay Detected In Messaging Overflow | Critical | manager, intelligence | Slow data processing detected in messaging topic Over Flow. <br><br>When event detected: "The number of pending messages in the messaging topic Over Flow is above the pending message threshold of {napp_messaging_lag_threshold}. "<br><br>When event resolved: "The number of pending messages in the messaging topic Over Flow is below the pending message threshold of {napp_messaging_lag_threshold}. " |
Add nodes and then scale up the NSX Application Platform cluster. If bottleneck can be attributed to a specific service, for example, the analytics service, then scale up the specific service when the new nodes are added. |
3.2.0 |
| TN Flow Exp Disconnected | High | esx, kvm, bms | A Transport node is disconnected from its NSX Messaging Broker. <br><br>When event detected: "The flow exporter on Transport node {entity_id} is disconnected from its messaging broker {messaging_broker_info}. Data collection is affected. "<br><br>When event resolved: "The flow exporter on Transport node {entity_id} has reconnected to its messaging broker {messaging_broker_info}. " |
Restart the messaging service if it is not running. Resolve the network connection failure between the Transport node flow exporter and its NSX messaging broker. |
3.2.0 |
| TN Flow Exp Disconnected On DPU | High | dpu | A Transport node is disconnected from its NSX messaging broker. <br><br>When event detected: "The flow exporter on Transport node {entity_id} DPU {dpu_id} is disconnected from its messaging broker {messaging_broker_info}. Data collection is affected. "<br><br>When event resolved: "The flow exporter on Transport node {entity_id} DPU {dpu_id} has reconnected to its messaging broker {messaging_broker_info}. " |
Restart the messaging service if it is not running. Resolve the network connection failure between the Transport node flow exporter and its NSX messaging broker. |
4.0.0 |
NSX Application Platform Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Cluster CPU Usage Very High | Critical | manager, intelligence | NSX Application Platform cluster CPU usage is very high. <br><br>When event detected: "The CPU usage of NSX Application Platform cluster {napp_cluster_id} is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of NSX Application Platform cluster {napp_cluster_id} is below the very high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if the load can be reduced. If more computing power is required, click on the Scale Out button to request more resources. |
3.2.0 |
| Cluster CPU Usage High | Medium | manager, intelligence | NSX Application Platform cluster CPU usage is high. <br><br>When event detected: "The CPU usage of NSX Application Platform cluster {napp_cluster_id} is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of NSX Application Platform cluster {napp_cluster_id} is below the high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if the load can be reduced. If more computing power is required, click on the Scale Out button to request more resources. |
3.2.0 |
| Cluster Memory Usage Very High | Critical | manager, intelligence | NSX Application Platform cluster memory usage is very high. <br><br>When event detected: "The memory usage of NSX Application Platform cluster {napp_cluster_id} is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of NSX Application Platform cluster {napp_cluster_id} is below the very high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if the load can be reduced. If more memory is required, click on the Scale Out button to request more resources. |
3.2.0 |
| Cluster Memory Usage High | Medium | manager, intelligence | NSX Application Platform cluster memory usage is high. <br><br>When event detected: "The memory usage of NSX Application Platform cluster {napp_cluster_id} is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of NSX Application Platform cluster {napp_cluster_id} is below the high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if the load can be reduced. If more memory is required, click on the Scale Out button to request more resources. |
3.2.0 |
| Cluster Disk Usage Very High | Critical | manager, intelligence | NSX Application Platform cluster disk usage is very high. <br><br>When event detected: "The disk usage of NSX Application Platform cluster {napp_cluster_id} is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of NSX Application Platform cluster {napp_cluster_id} is below the very high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. See if the load can be reduced. If more disk storage is required, click on the Scale Out button to request more resources. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size. |
3.2.0 |
| Cluster Disk Usage High | Medium | manager, intelligence | NSX Application Platform cluster disk usage is high. <br><br>When event detected: "The disk usage of NSX Application Platform cluster {napp_cluster_id} is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of NSX Application Platform cluster {napp_cluster_id} is below the high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. See if the load can be reduced. If more disk storage is required, click on the Scale Out button to request more resources. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size. |
3.2.0 |
| NAPP Status Degraded | Medium | manager, intelligence | NSX Application Platform cluster overall status is degraded. <br><br>When event detected: "NSX Application Platform cluster {napp_cluster_id} overall status is degraded. "<br><br>When event resolved: "NSX Application Platform cluster {napp_cluster_id} is running properly. " |
Get more information from alarms of nodes and services. |
3.2.0 |
| NAPP Status Down | High | manager, intelligence | NSX Application Platform cluster overall status is down. <br><br>When event detected: "NSX Application Platform cluster {napp_cluster_id} overall status is down. "<br><br>When event resolved: "NSX Application Platform cluster {napp_cluster_id} is running properly. " |
Get more information from alarms of nodes and services. |
3.2.0 |
| Node CPU Usage Very High | Critical | manager, intelligence | NSX Application Platform node CPU usage is very high. <br><br>When event detected: "The CPU usage of NSX Application Platform node {napp_node_name} is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of NSX Application Platform node {napp_node_name} is below the very high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high CPU usage, by default, Kubernetes will reschedule services automatically. If most nodes have high CPU usage and load cannot be reduced, click on the Scale Out button to request more resources. |
3.2.0 |
| Node CPU Usage High | Medium | manager, intelligence | NSX Application Platform node CPU usage is high. <br><br>When event detected: "The CPU usage of NSX Application Platform node {napp_node_name} is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of NSX Application Platform node {napp_node_name} is below the high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high CPU usage, by default, Kubernetes will reschedule services automatically. If most nodes have high CPU usage and load cannot be reduced, click on the Scale Out button to request more resources. |
3.2.0 |
| Node Memory Usage Very High | Critical | manager, intelligence | NSX Application Platform node memory usage is very high. <br><br>When event detected: "The memory usage of NSX Application Platform node {napp_node_name} is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of NSX Application Platform node {napp_node_name} is below the very high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high Memory usage, by default, Kubernetes will reschedule services automatically. If most nodes have high Memory usage and load cannot be reduced, click on the Scale Out button to request more resources. |
3.2.0 |
| Node Memory Usage High | Medium | manager, intelligence | NSX Application Platform node memory usage is high. <br><br>When event detected: "The memory usage of NSX Application Platform node {napp_node_name} is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of NSX Application Platform node {napp_node_name} is below the high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high Memory usage, by default, Kubernetes will reschedule services automatically. If most nodes have high Memory usage and load cannot be reduced, click on the Scale Out button to request more resources. |
3.2.0 |
| Node Disk Usage Very High | Critical | manager, intelligence | NSX Application Platform node disk usage is very high. <br><br>When event detected: "The disk usage of NSX Application Platform node {napp_node_name} is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of NSX Application Platform node {napp_node_name} is below the very high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. Clean up unused data or log to free up disk resources and see if the load can be reduced. If more disk storage is required, Scale Out the service under pressure. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size. |
3.2.0 |
| Node Disk Usage High | Medium | manager, intelligence | NSX Application Platform node disk usage is high. <br><br>When event detected: "The disk usage of NSX Application Platform node {napp_node_name} is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of NSX Application Platform node {napp_node_name} is below the high threshold value of {system_usage_threshold}%. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. Clean up unused data or log to free up disk resources and see if the load can be reduced. If more disk storage is required, Scale Out the service under pressure. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size. |
3.2.0 |
| Node Status Degraded | Medium | manager, intelligence | NSX Application Platform node status is degraded. <br><br>When event detected: "NSX Application Platform node {napp_node_name} is degraded. "<br><br>When event resolved: "NSX Application Platform node {napp_node_name} is running properly. " |
In the NSX UI, navigate to System | NSX Application Platform | Resources to check which node is degraded. Check network, memory and CPU usage of the node. Reboot the node if it is a worker node. |
3.2.0 |
| Node Status Down | High | manager, intelligence | NSX Application Platform node status is down. <br><br>When event detected: "NSX Application Platform node {napp_node_name} is not running. "<br><br>When event resolved: "NSX Application Platform node {napp_node_name} is running properly. " |
In the NSX UI, navigate to System | NSX Application Platform | Resources to check which node is down. Check network, memory and CPU usage of the node. Reboot the node if it is a worker node. |
3.2.0 |
| Datastore CPU Usage Very High | Critical | manager, intelligence | Data Storage service CPU usage is very high. <br><br>When event detected: "The CPU usage of Data Storage service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Data Storage service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Data Storage service. |
3.2.0 |
| Datastore CPU Usage High | Medium | manager, intelligence | Data Storage service CPU usage is high. <br><br>When event detected: "The CPU usage of Data Storage service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Data Storage service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Data Storage service. |
3.2.0 |
| Messaging CPU Usage Very High | Critical | manager, intelligence | Messaging service CPU usage is very high. <br><br>When event detected: "The CPU usage of Messaging service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Messaging service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Messaging service. |
3.2.0 |
| Messaging CPU Usage High | Medium | manager, intelligence | Messaging service CPU usage is high. <br><br>When event detected: "The CPU usage of Messaging service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Messaging service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Messaging service. |
3.2.0 |
| Configuration DB CPU Usage Very High | Critical | manager, intelligence | Configuration Database service CPU usage is very high. <br><br>When event detected: "The CPU usage of Configuration Database service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Configuration Database service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Configuration DB CPU Usage High | Medium | manager, intelligence | Configuration Database service CPU usage is high. <br><br>When event detected: "The CPU usage of Configuration Database service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Configuration Database service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Metrics CPU Usage Very High | Critical | manager, intelligence | Metrics service CPU usage is very high. <br><br>When event detected: "The CPU usage of Metrics service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Metrics service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Metrics CPU Usage High | Medium | manager, intelligence | Metrics service CPU usage is high. <br><br>When event detected: "The CPU usage of Metrics service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Metrics service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Analytics CPU Usage Very High | Critical | manager, intelligence | Analytics service CPU usage is very high. <br><br>When event detected: "The CPU usage of Analytics service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Analytics service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Analytics service. |
3.2.0 |
| Analytics CPU Usage High | Medium | manager, intelligence | Analytics service CPU usage is high. <br><br>When event detected: "The CPU usage of Analytics service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Analytics service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Analytics service. |
3.2.0 |
| Platform CPU Usage Very High | Critical | manager, intelligence | Platform Services service CPU usage is very high. <br><br>When event detected: "The CPU usage of Platform Services service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Platform Services service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Platform CPU Usage High | Medium | manager, intelligence | Platform Services service CPU usage is high. <br><br>When event detected: "The CPU usage of Platform Services service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The CPU usage of Platform Services service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Datastore Memory Usage Very High | Critical | manager, intelligence | Data Storage service memory usage is very high. <br><br>When event detected: "The memory usage of Data Storage service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Data Storage service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Data Storage service. |
3.2.0 |
| Datastore Memory Usage High | Medium | manager, intelligence | Data Storage service memory usage is high. <br><br>When event detected: "The memory usage of Data Storage service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Data Storage service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Data Storage service. |
3.2.0 |
| Messaging Memory Usage Very High | Critical | manager, intelligence | Messaging service memory usage is very high. <br><br>When event detected: "The memory usage of Messaging service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Messaging service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Messaging service. |
3.2.0 |
| Messaging Memory Usage High | Medium | manager, intelligence | Messaging service memory usage is high. <br><br>When event detected: "The memory usage of Messaging service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Messaging service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Messaging service. |
3.2.0 |
| Configuration DB Memory Usage Very High | Critical | manager, intelligence | Configuration Database service memory usage is very high. <br><br>When event detected: "The memory usage of Configuration Database service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Configuration Database service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Configuration DB Memory Usage High | Medium | manager, intelligence | Configuration Database service memory usage is high. <br><br>When event detected: "The memory usage of Configuration Database service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Configuration Database service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Metrics Memory Usage Very High | Critical | manager, intelligence | Metrics service memory usage is very high. <br><br>When event detected: "The memory usage of Metrics service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Metrics service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Metrics Memory Usage High | Medium | manager, intelligence | Metrics service memory usage is high. <br><br>When event detected: "The memory usage of Metrics service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Metrics service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Analytics Memory Usage Very High | Critical | manager, intelligence | Analytics service memory usage is very high. <br><br>When event detected: "The memory usage of Analytics service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Analytics service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Analytics service. |
3.2.0 |
| Analytics Memory Usage High | Medium | manager, intelligence | Analytics service memory usage is high. <br><br>When event detected: "The memory usage of Analytics service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Analytics service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services or the Analytics service. |
3.2.0 |
| Platform Memory Usage Very High | Critical | manager, intelligence | Platform Services service memory usage is very high. <br><br>When event detected: "The memory usage of Platform Services service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Platform Services service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Platform Memory Usage High | Medium | manager, intelligence | Platform Services service memory usage is high. <br><br>When event detected: "The memory usage of Platform Services service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The memory usage of Platform Services service is below the high threshold value of {system_usage_threshold}%. " |
Scale out all services. |
3.2.0 |
| Datastore Disk Usage Very High | Critical | manager, intelligence | Data Storage service disk usage is very high. <br><br>When event detected: "The disk usage of Data Storage service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Data Storage service is below the very high threshold value of {system_usage_threshold}%. " |
Scale out or scale up the data storage service. |
3.2.0 |
| Datastore Disk Usage High | Medium | manager, intelligence | Data Storage service disk usage is high. <br><br>When event detected: "The disk usage of Data Storage service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Data Storage service is below the high threshold value of {system_usage_threshold}%. " |
Scale out or scale up the data storage service. |
3.2.0 |
| Messaging Disk Usage Very High | Critical | manager, intelligence | Messaging service disk usage is very high. <br><br>When event detected: "The disk usage of Messaging service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Messaging service is below the very high threshold value of {system_usage_threshold}%. " |
Clean up files not needed. Scale out all services or the Messaging service. |
3.2.0 |
| Messaging Disk Usage High | Medium | manager, intelligence | Messaging service disk usage is high. <br><br>When event detected: "The disk usage of Messaging service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Messaging service is below the high threshold value of {system_usage_threshold}%. " |
Clean up files not needed. Scale out all services or the Messaging service. |
3.2.0 |
| Configuration DB Disk Usage Very High | Critical | manager, intelligence | Configuration Database service disk usage is very high. <br><br>When event detected: "The disk usage of Configuration Database service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Configuration Database service is below the very high threshold value of {system_usage_threshold}%. " |
Clean up files not needed. Scale out all services. |
3.2.0 |
| Configuration DB Disk Usage High | Medium | manager, intelligence | Configuration Database service disk usage is high. <br><br>When event detected: "The disk usage of Configuration Database service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Configuration Database service is below the high threshold value of {system_usage_threshold}%. " |
Clean up files not needed. Scale out all services. |
3.2.0 |
| Metrics Disk Usage Very High | Critical | manager, intelligence | Metrics service disk usage is very high. <br><br>When event detected: "The disk usage of Metrics service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Metrics service is below the very high threshold value of {system_usage_threshold}%. " |
Contact VMware support to review storage usage. |
3.2.0 |
| Metrics Disk Usage High | Medium | manager, intelligence | Metrics service disk usage is high. <br><br>When event detected: "The disk usage of Metrics service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Metrics service is below the high threshold value of {system_usage_threshold}%. " |
Contact VMware support to review storage usage. |
3.2.0 |
| Analytics Disk Usage Very High | Critical | manager, intelligence | Analytics service disk usage is very high. <br><br>When event detected: "The disk usage of Analytics service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Analytics service is below the very high threshold value of {system_usage_threshold}%. " |
Clean up files not needed. Scale out all services or the Analytics service. |
3.2.0 |
| Analytics Disk Usage High | Medium | manager, intelligence | Analytics service disk usage is high. <br><br>When event detected: "The disk usage of Analytics service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Analytics service is below the high threshold value of {system_usage_threshold}%. " |
Clean up files not needed. Scale out all services or the Analytics service. |
3.2.0 |
| Platform Disk Usage Very High | Critical | manager, intelligence | Platform Services service disk usage is very high. <br><br>When event detected: "The disk usage of Platform Services service is above the very high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Platform Services service is below the very high threshold value of {system_usage_threshold}%. " |
Clean up files not needed. Scale out all services. |
3.2.0 |
| Platform Disk Usage High | Medium | manager, intelligence | Platform Services service disk usage is high. <br><br>When event detected: "The disk usage of Platform Services service is above the high threshold value of {system_usage_threshold}%. "<br><br>When event resolved: "The disk usage of Platform Services service is below the high threshold value of {system_usage_threshold}%. " |
Clean up files not needed. Scale out all services. |
3.2.0 |
| Service Status Degraded | Medium | manager, intelligence | Service status is degraded. <br><br>When event detected: "Service {napp_service_name} is degraded. The service may still be able to reach a quorum while pods associated with {napp_service_name} are not all stable. Resources consumed by these unstable pods may be released. "<br><br>When event resolved: "Service {napp_service_name} is running properly. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is degraded and the reason behind it. Invoke the following CLI command to restart the degraded service if necessary: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> Degraded services can function correctly but performance is sub-optimal. |
3.2.0 |
| Service Status Down | High | manager, intelligence | Service status is down. <br><br>When event detected: "Service {napp_service_name} is not running. "<br><br>When event resolved: "Service {napp_service_name} is running properly. " |
In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is down and the reason behind it. Invoke the following CLI command to restart the degraded service: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> |
3.2.0 |
| Flow Storage Growth High | Medium | manager, intelligence | Analytics and Data Storage disk usage is growing faster than expected. <br><br>When event detected: "Analytics and Data Storage disks are expected to be full in {predicted_full_period} days, less than current data retention period {current_retention_period} days. "<br><br>When event resolved: "Analytics and Data Storage disk usage growth is normal. " |
Connect less transport nodes or set narrower private IP ranges to reduce the number of unique flows. Filter out broadcast and/or multcast flows. Scale out Analytics and Data Storage services to get more storage. |
4.1.1 |
Password Management Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Password Expired | Critical | global-manager, manager, edge, public-cloud-gateway | User password has expired. <br><br>When event detected: "The password for user {username} has expired. "<br><br>When event resolved: "The password for user {username} has been changed successfully or is no longer expired or the user is no longer active. " |
The password for user {username} must be changed now to access the system. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body: PUT /api/v1/node/users/<userid> where <userid> is the ID of the user. If the admin user (with <userid> 10000) password has expired, admin must login to the system via SSH (if enabled) or console in order to change the password. Upon entering the current expired password, admin will be prompted to enter a new password. |
3.0.0 |
| Password Is About To Expire | High | global-manager, manager, edge, public-cloud-gateway | User password is about to expire. <br><br>When event detected: "The password for user {username} is about to expire in {password_expiration_days} days. "<br><br>When event resolved: "The password for the user {username} has been changed successfully or is no longer expired or the user is no longer active. " |
Ensure the password for the user {username} is changed immediately. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body: PUT /api/v1/node/users/<userid> where <userid> is the ID of the user. |
3.0.0 |
| Password Expiration Approaching | Medium | global-manager, manager, edge, public-cloud-gateway | User password is approaching expiration. <br><br>When event detected: "The password for user {username} is approaching expiration in {password_expiration_days} days. "<br><br>When event resolved: "The password for the user {username} has been changed successfully or is no longer expired or the user is no longer active. " |
The password for the user {username} needs to be changed soon. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body: PUT /api/v1/node/users/<userid> where <userid> is the ID of the user. |
3.0.0 |
Physical Server Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Physical Server Install Failed | Critical | manager | Physical Server (BMS) installation failed. <br><br>When event detected: "Physical Server {transport_node_name} ({entity_id}) installation failed. "<br><br>When event resolved: "Physical Server {transport_node_name} ({entity_id}) installation completed. " |
Navigate to System > Fabric > Nodes > Host Transport Nodes and resolve the error on the node. |
4.0.0 |
| Physical Server Upgrade Failed | Critical | manager | Physical Server (BMS) upgrade failed. <br><br>When event detected: "Physical Server {transport_node_name} ({entity_id}) upgrade failed. "<br><br>When event resolved: "Physical Server {transport_node_name} ({entity_id}) upgrade completed. " |
Navigate to System > Upgrade and resolve the error, then re-trigger the upgrade. |
4.0.0 |
| Physical Server Uninstall Failed | Critical | manager | Physical Server (BMS) uninstallation failed. <br><br>When event detected: "Physical Server {transport_node_name} ({entity_id}) uninstallation failed. "<br><br>When event resolved: "Physical Server {transport_node_name} ({entity_id}) uninstallation completed. " |
Navigate to System > Fabric > Nodes > Host Transport Nodes and resolve the error on the node. |
4.0.0 |
Policy Constraint Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Creation Count Limit Reached | Medium | manager | Entity count has reached the policy constraint limit. <br><br>When event detected: "Entity count for type {constraint_type} in {constraint_type_path} is currently at {current_count} which reached the maximum limit of {constraint_limit}. "<br><br>When event resolved: "{constraint_type} Count is below threshold. " |
Review {constraint_type} usage. Update the constraint to increase the limit or delete unused {constraint_type}. |
4.1.0 |
| Creation Count Limit Reached For Project | Medium | manager | Entity count has reached the policy constraint limit. <br><br>When event detected: "Entity count for type {constraint_type} in {project_path} is currently at {current_count} which reached the maximum limit of {constraint_limit}. "<br><br>When event resolved: "{constraint_type} Count is below threshold. " |
Review {constraint_type} usage. Update the constraint to increase the limit or delete unused {constraint_type}. |
4.1.1 |
Routing Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| BFD Down On External Interface | High | edge, autonomous-edge, public-cloud-gateway | BFD session is down. <br><br>When event detected: "In router {lr_id}, BFD session for peer {peer_address} is down. "<br><br>When event resolved: "In router {lr_id}, BFD session for peer {peer_address} is up. " |
1. Invoke the NSX CLI command get logical-routers. |
3.0.0 |
| Static Routing Removed | High | edge, autonomous-edge, public-cloud-gateway | Static route removed. <br><br>When event detected: "In router {lr_id}, static route {entity_id} ({static_address}) was removed because BFD was down. "<br><br>When event resolved: "In router {lr_id}, static route {entity_id} ({static_address}) was re-added as BFD recovered. " |
The static routing entry was removed because the BFD session was down. |
3.0.0 |
| BGP Down | High | edge, autonomous-edge, public-cloud-gateway | BGP neighbor down. <br><br>When event detected: "In Router {lr_id}, BGP neighbor {entity_id} ({bgp_neighbor_ip}) is down. Reason: {failure_reason}. "<br><br>When event resolved: "In Router {lr_id}, BGP neighbor {entity_id} ({bgp_neighbor_ip}) is up. " |
1. Invoke the NSX CLI command get logical-routers. |
3.0.0 |
| Proxy ARP Not Configured For Service IP | Critical | manager | Proxy ARP is not configured for Service IP. <br><br>When event detected: "Proxy ARP for Service IP {service_ip} and Service entity {entity_id} is not configured as the number of ARP proxy entries generated due to overlap of the Service IP with subnet of lrport {lrport_id} on Router {lr_id} has exceeded the allowed threshold limit of 16384. "<br><br>When event resolved: "Proxy ARP for Service entity {entity_id} is generated successfully as the overlap of service IP with subnet of lrport {lrport_id} on Router {lr_id} is within the allowed limit of 16384 entries. " |
Reconfigure the Service IP {service_ip} for the Service entity {entity_id} or change the subnet of the lrport {lrport_id} on Router {lr_id} so that the proxy ARP entries generated due to the overlap between the Service IP and the subnet of lrport is less than the allowed threshold limit of 16384. |
3.0.3 |
| Routing Down | High | edge, autonomous-edge, public-cloud-gateway | All BGP/BFD sessions are down. <br><br>When event detected: "All BGP/BFD sessions are down. "<br><br>When event resolved: "At least one BGP/BFD session up. " |
Invoke the NSX CLI command get logical-routers to get the tier0 service router and switch to this vrf, then invoke the following NSX CLI commands. |
3.0.0 |
| OSPF Neighbor Went Down | High | edge, autonomous-edge, public-cloud-gateway | OSPF neighbor moved from full to another state. <br><br>When event detected: "OSPF neighbor {peer_address} moved from full to another state. "<br><br>When event resolved: "OSPF neighbor {peer_address} moved to full state. " |
1. Invoke the NSX CLI command get logical-routers to get the vrf id and switch to TIER0 service router. |
3.1.1 |
| Maximum IPv4 Route Limit Approaching | Medium | edge, autonomous-edge, public-cloud-gateway | Maximum IPv4 Routes limit is approaching on Edge node. <br><br>When event detected: "IPv4 routes limit has reached {route_limit_threshold} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "<br><br>When event resolved: "IPv4 routes are within the limit of {route_limit_threshold} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. " |
1. Check route redistribution policies and routes received from all external peers. |
4.0.0 |
| Maximum IPv6 Route Limit Approaching | Medium | edge, autonomous-edge, public-cloud-gateway | Maximum IPv6 Routes limit is approaching on Edge node. <br><br>When event detected: "IPv6 routes limit has reached {route_limit_threshold} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "<br><br>When event resolved: "IPv6 routes are within the limit of {route_limit_threshold} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. " |
1. Check route redistribution policies and routes received from all external peers. |
4.0.0 |
| Maximum IPv4 Route Limit Exceeded | Critical | edge, autonomous-edge, public-cloud-gateway | Maximum IPv4 Routes limit has exceeded on Edge node. <br><br>When event detected: "IPv4 routes has exceeded limit of {route_limit_maximum} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "<br><br>When event resolved: "IPv4 routes are within the limit of {route_limit_maximum} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. " |
1. Check route redistribution policies and routes received from all external peers. |
4.0.0 |
| Maximum IPv6 Route Limit Exceeded | Critical | edge, autonomous-edge, public-cloud-gateway | Maximum IPv6 Routes limit has exceeded on Edge node. <br><br>When event detected: "IPv6 routes has exceeded limit of {route_limit_maximum} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "<br><br>When event resolved: "IPv6 routes are within the limit of {route_limit_maximum} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. " |
1. Check route redistribution policies and routes received from all external peers. |
4.0.0 |
| Maximum IPv4 Prefixes From BGP Neighbor Approaching | Medium | edge, autonomous-edge, public-cloud-gateway | Maximum IPv4 Prefixes received from BGP neighbor is approaching. <br><br>When event detected: "Number of IPv4 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} reaches {prefixes_count_threshold}. Limit defined for this peer is {prefixes_count_max}. "<br><br>When event resolved: "Number of IPv4 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} is within the limit {prefixes_count_threshold}. " |
1. Check the BGP routing policies in the external router. |
4.0.0 |
| Maximum IPv6 Prefixes From BGP Neighbor Approaching | Medium | edge, autonomous-edge, public-cloud-gateway | Maximum IPv6 Prefixes received from BGP neighbor is approaching. <br><br>When event detected: "Number of IPv6 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} reaches {prefixes_count_threshold}. Limit defined for this peer is {prefixes_count_max}. "<br><br>When event resolved: "Number of IPv6 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} is within the limit {prefixes_count_threshold}. " |
1. Check the BGP routing policies in the external router. |
4.0.0 |
| Maximum IPv4 Prefixes From BGP Neighbor Exceeded | Critical | edge, autonomous-edge, public-cloud-gateway | Maximum IPv4 Prefixes received from BGP neighbor has exceeded. <br><br>When event detected: "Number of IPv4 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} exceeded the limit defined for this peer of {prefixes_count_max}. "<br><br>When event resolved: "Number of IPv4 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} is within the limit {prefixes_count_max}. " |
1. Check the BGP routing policies in the external router. |
4.0.0 |
| Maximum IPv6 Prefixes From BGP Neighbor Exceeded | Critical | edge, autonomous-edge, public-cloud-gateway | Maximum IPv6 Prefixes received from BGP neighbor has exceeded. <br><br>When event detected: "Number of IPv6 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} exceeded the limit defined for this peer of {prefixes_count_max}. "<br><br>When event resolved: "Number of IPv6 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} is within the limit {prefixes_count_max}. " |
1. Check the BGP routing policies in the external router. |
4.0.0 |
Security Compliance Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Trigger NDcPP Non-Compliance | Critical | manager | The NSX security status is not NDcPP compliant. <br><br>When event detected: "One of the NDcPP compliance requirements is being violated. That means the NSX status is currently non-compliant with regards to NDcPP. "<br><br>When event resolved: "The NDcPP compliance issues have all been resolved. " |
Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the NDcPP compliance name. |
4.1.0 |
| Trigger EAL4+ Non-Compliance | Critical | manager | The NSX security status is not EAL4+ compliant. <br><br>When event detected: "One of the EAL4+ compliance requirements is being violated. That means the NSX status is currently non-compliant with regards to EAL4+. "<br><br>When event resolved: "The EAL4+ compliance issues have all been resolved. " |
Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the EAL4+ compliance name. |
4.1.0 |
| Poll NDcPP Non-Compliance | Critical | manager | The NSX security configuration is not NDcPP compliant. <br><br>When event detected: "One of the NDcPP compliance requirements is being violated. That means the NSX configuration is currently non-compliant with regards to NDcPP. "<br><br>When event resolved: "The NDcPP compliance issues have all been resolved. " |
Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the NDcPP compliance name. |
4.1.0 |
| Poll EAL4+ Non-Compliance | Critical | manager | The NSX security configuration is not EAL4+ compliant. <br><br>When event detected: "One of the EAL4+ compliance requirements is being violated. That means the NSX configuration is currently non-compliant with regards to EAL4+. "<br><br>When event resolved: "The EAL4+ compliance issues have all been resolved. " |
Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the EAL4+ compliance name. |
4.1.0 |
Service Insertion Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Service Deployment Succeeded | Info | manager | Service deployment succeeded. <br><br>When event detected: "The service deployment {entity_id} for service {service_name} on cluster {vcenter_cluster_id} has succeeded. "<br><br>When event resolved: "The service deployment {entity_id} on cluster {vcenter_cluster_id} has succeeded, no action needed. " |
No action needed. |
4.0.0 |
| Service Deployment Failed | Critical | manager | Service deployment failed. <br><br>When event detected: "The service deployment {entity_id} for service {service_name} on cluster {vcenter_cluster_id} has failed. Reason : {failure_reason} "<br><br>When event resolved: "The failed service deployment {entity_id} has been removed. " |
Delete the service deployment using NSX UI or API. Perform any corrective action from the KB and retry service deployment again. |
4.0.0 |
| Service Undeployment Succeeded | Info | manager | Service deployment deletion succeeded. <br><br>When event detected: "The deletion of service deployment {entity_id} for service {service_name} on cluster {vcenter_cluster_id} has succeeded. "<br><br>When event resolved: "The deletion of service deployment {entity_id} on cluster {vcenter_cluster_id} has succeeded, no action needed. " |
No action needed. |
4.0.0 |
| Service Undeployment Failed | Critical | manager | Service deployment deletion failed. <br><br>When event detected: "The deletion of service deployment {entity_id} for service {service_name} on cluster {vcenter_cluster_id} has failed. Reason : {failure_reason} "<br><br>When event resolved: "The failed service deployment name {entity_id} has been removed. " |
Delete the service deployment using NSX UI or API. Perform any corrective action from the KB and retry deleting the service deployment again. Resolve the alarm manually after checking all the VM and objects are deleted. |
4.0.0 |
| SVM Health Status Up | Info | manager | SVM is working in service. <br><br>When event detected: "The health check for SVM {entity_id} for service {service_name} is working correctly on {hostname_or_ip_address_with_port}. "<br><br>When event resolved: "The SVM {entity_id} is working correctly, no action needed. " |
No action needed. |
4.0.0 |
| SVM Health Status Down | High | manager | SVM is not working in service. <br><br>When event detected: "The health check for SVM {entity_id} for service {service_name} is not working correctly on {hostname_or_ip_address_with_port}. Reason : {failure_reason}. "<br><br>When event resolved: "The SVM {entity_id} with wrong state has been removed. " |
Delete the service deployment using NSX UI or API. Perform any corrective action from the KB and retry service deployment again if necessary. |
4.0.0 |
| Service Insertion Infra Status Down | Critical | esx | Service insertion infrastructure status down and not enabled on host. <br><br>When event detected: "SPF not enabled at port level on host {transport_node_id} and the status is down. Reason : {failure_reason}. "<br><br>When event resolved: "Service insertion infrastructure status is up and has been correctly enabled on host. " |
Perform any corrective action from the KB and check if the status is up. Resolve the alarm manually after checking the status. |
4.0.0 |
| SVM Liveness State Down | Critical | manager | SVM liveness state down. <br><br>When event detected: "SVM liveness state is down on {entity_id} and traffic flow is impacted. "<br><br>When event resolved: "SVM liveness state is up and configured as expected. " |
Perform any corrective action from the KB and check if the state is up. |
4.0.0 |
| Service Chain Path Down | Critical | manager | Service chain path down. <br><br>When event detected: "Service chain path is down on {entity_id} and traffic flow is impacted. "<br><br>When event resolved: "Service chain path is up and configured as expected. " |
Perform any corrective action from the KB and check if the status is up. |
4.0.0 |
| New Host Added | Info | esx | New Host added in cluster. <br><br>When event detected: "New host added in cluster {vcenter_cluster_id} and SVM will be deployed. "<br><br>When event resolved: "New host added successfully. " |
Check for the VM deployment status and wait till it powers on. |
4.0.0 |
Tep Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Faulty Tep | Medium | esx | TEP is unhealthy. <br><br>When event detected: "TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id}. Overlay workloads using this TEP will face network outage. Reason: {vtep_fault_reason}. "<br><br>When event resolved: "TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} is healthy. " |
1. Check if TEP has valid IP or any other underlay connectivity issues. |
4.1.0 |
| Tep Ha Activated | Info | esx | TEP HA activated. <br><br>When event detected: "TEP HA activated for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id}. "<br><br>When event resolved: "TEP HA cleared for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id}. " |
Enable AutoRecover or invoke Manual Recover for TEP:{vtep_name} on VDS:{dvs_name} at Transport node:{transport_node_id}. |
4.1.0 |
| Tep Autorecover Success | Info | esx | AutoRecover is successful. <br><br>When event detected: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} is successful. "<br><br>When event resolved: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} is cleared. " |
None. |
4.1.0 |
| Tep Autorecover Failure | Medium | esx | AutoRecover failed. <br><br>When event detected: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} failed. Overlay workloads using this TEP will failover to other healthy TEPs. If no other healthy TEPs, overlay workloads will face network outage. "<br><br>When event resolved: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} is cleared. " |
Check if TEP has valid IP or any other underlay connectivity issues. |
4.1.0 |
| Faulty Tep On DPU | Medium | dpu | TEP is unhealthy on DPU. <br><br>When event detected: "TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id}. Overlay workloads using this TEP will face network outage. Reason: {vtep_fault_reason}. "<br><br>When event resolved: "TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id} is healthy. " |
1. Check if TEP has valid IP or any other underlay connectivity issues. |
4.1.0 |
| Tep Ha Activated On DPU | Info | dpu | TEP HA activated on DPU. <br><br>When event detected: "TEP HA activated for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id}. "<br><br>When event resolved: "TEP HA cleared for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id}. " |
Enable AutoRecover or invoke Manual Recover for TEP:{vtep_name} on VDS:{dvs_name}. at Transport node:{transport_node_id} on DPU {dpu_id}. |
4.1.0 |
| Tep Autorecover Success On DPU | Info | dpu | AutoRecover is successful on DPU. <br><br>When event detected: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id}. on DPU {dpu_id} is successful. "<br><br>When event resolved: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id}. on DPU {dpu_id} is cleared. " |
None. |
4.1.0 |
| Tep Autorecover Failure On DPU | Medium | dpu | AutoRecover failed on DPU. <br><br>When event detected: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id} failed. Overlay workloads using this TEP will failover to other healthy TEPs. If no other healthy TEPs, overlay workloads will face network outage. "<br><br>When event resolved: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id} is cleared. " |
Check if TEP has valid IP or any other underlay connectivity issues. |
4.1.0 |
Transport Node Health Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Monitoring Framework Unhealthy | Medium | global-manager, bms, edge, esx, kvm, manager, public-cloud-gateway | Monitoring Service framework on transport node is unhealthy. <br><br>When event detected: "Monitoring Service framework on the host with UUID {entity_id} is unhealthy for more than 5 minutes. Stats and status will not be collected from this host. "<br><br>When event resolved: "Monitoring Service framework on the host with UUID {entity_id} is healthy. " |
1. On problematic nsx-edge or nsx-public-gateway node, please invoke systemctl restart nsx-edge-exporter. |
4.1.1 |
| Transport Node Uplink Down On DPU | Medium | dpu | Uplink on DPU is going down. <br><br>When event detected: "Uplink on DPU {dpu_id} is going down. "<br><br>When event resolved: "Uplink on DPU {dpu_id} is going up. " |
Check the physical NICs' status of uplinks on DPU {dpu_id}. Find out the mapped name of this physical NIC on host, then perform checking on UI. |
4.0.0 |
| LAG Member Down On DPU | Medium | dpu | LACP on DPU reporting member down. <br><br>When event detected: "LACP on DPU {dpu_id} reporting member down. "<br><br>When event resolved: "LACP on DPU {dpu_id} reporting member up. " |
Check the connection status of LAG members on DPU {dpu_id}. Find out the mapped name of related physical NIC on host, then perform checking on UI. |
4.0.0 |
| NVDS Uplink Down (deprecated) | Medium | esx, kvm, bms | Uplink is going down. <br><br>When event detected: "Uplink is going down. "<br><br>When event resolved: "Uplink is going up. " |
Check the physical NICs' status of uplinks on hosts. |
3.0.0 |
| Transport Node Uplink Down | Medium | esx, kvm, bms | Uplink is going down. <br><br>When event detected: "Uplink is going down. "<br><br>When event resolved: "Uplink is going up. " |
Check the physical NICs' status of uplinks on hosts. |
3.2.0 |
| LAG Member Down | Medium | esx, kvm, bms | LACP reporting member down. <br><br>When event detected: "LACP reporting member down. "<br><br>When event resolved: "LACP reporting member up. " |
Check the connection status of LAG members on hosts. |
3.0.0 |
Transport Node Pending Action Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Maintenance Mode | Critical | manager | The host has pending user actions i.e. PENDING_HOST_MAINTENANCE_MODE. <br><br>When event detected: "Host {host_name} - {host_uuid} has PENDING_HOST_MAINTENANCE_MODE user action. It means high performance configuration is not yet realized on the host. "<br><br>When event resolved: "The host {host_name} - {host_uuid} no longer has PENDING_HOST_MAINTENANCE_MODE user action. " |
Move host {host_name} - {host_uuid} to maintenance mode from vCenter. This will start realization of high performance configuration on the host. If processed successfully, transportNodeState will no longer have PENDING_HOST_MAINTENANCE_MODE inside pending_user_actions field. If realization of high performance configuration on the host fails, then transportNodeState will be updated with the failure message and the host will no longer be in pending maintenance mode. |
4.1.2 |
Vmc App Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| Transit Connect Failure | Medium | manager | Transit Connect fails to be fully realized. <br><br>When event detected: "Transit Connect related configuration is not fully correctly realized. Possible issues could be failing to retrieve provider information or some transient provider communication error. "<br><br>When event resolved: "Transit Connect failure is remediated. " |
If this alarm is not auto-resolved within 10 minutes, retry the most recent transit connect related request(s). For example, if a TGW attachment API request triggered this alarm, retry the TGW attachment API request again. If alarm does not resolve even after retry, then try the following steps: |
4.1.0 |
| Traffic Group Prefix List Deletion Failure | High | manager | Failure in deletion of Traffic Group Prefix list. <br><br>When event detected: "When prefix list mode for connected VPC is activated, customer can choose to program all connected VPC prefix lists (this includes multi-edge prefix list as well) in non-main route tables of their choice. By policy, SDDC software would not add/delete prefix lists into non-main routing table. Lets say customer has programmed a traffic group prefix list in connected VPC non-main routing tables. Then after some time, customer chooses to delete a traffic group. When a traffic group is deleted, SDDC software could not delete the TG prefix list because TG prefix list is programmed into non-main routing table. In some cases, the TG prefix list can be programmed in multiple non-main route tables at a time. So an alarm would be raised to notify SRE team. SRE would notify customer to delete prefix list from non-main routing table. Once deleted from routing table, prefix list can be deleted by SDDC software. "<br><br>When event resolved: "Traffic Group Prefix list deletion failure is remediated. " |
If this alarm is not auto-resolved within 10 minutes, then execute the following steps: |
4.1.2 |
| Prefix List Capacity Issue Failure | High | manager | Prefix list capacity issue failure. <br><br>When event detected: "VMC App cannot program AWS managed prefix list with route/prefix because number of entries in AWS managed prefix list has reached size of the prefix list. "<br><br>When event resolved: "Capacity issue with Prefix list is remediated. " |
1. Run API GET 'cloud-service/api/v1/infra/sddc/provider-resource-info?resource_type=managed_prefix_list' to get a list of all prefix lists from SDDC. a) Check the 'state' and 'status_message' of each prefix list in API output. b) If the state of any prefix list is 'modify-failed' and status message has the string 'The following VPC Route Table resources do not have sufficient capacity' then the prefix list has run into resizing failure. The 'status-message' is going to specify what route table Ids have to be increased in size. c) If the API output contains 'issues' field, it would specify what routes are missing from the managed prefix list. Calculate number of routes missing from 'issues' field. d) File a AWS ticket to increase size of the routing table identified in (b) by atleast minimum size identified in (c). e) After AWS increased the route table limit, wait for atleast 1 hour and then invoke API' GET 'cloud-service/api/v1/infra/sddc/provider-resource-info?resource_type=managed_prefix_list'. Make sure 'state' of any of the prefix list is not 'modify-failed'. |
4.1.2 |
| Prefix List Resource Share Customer Failure | Medium | manager | Failure with prefix list resource share. <br><br>When event detected: "This issue would occur when customer accidentally or intentionally clicks on 'Leave resource share' in the customer account. "<br><br>When event resolved: "Failure with prefix list resource share is remediated. " |
If this alarm is not auto-resolved within 10 minutes, then execute the following steps: |
4.1.2 |
| Resource Share Sanity Check Failure | High | manager | Failure in resource share check. <br><br>When event detected: "Customer activates prefix list mode for connected VPC from connected VPC page in networking tab of NSX manager UI. After this step, customer would be prompted to accept resource share if prefix list mode needs to be activated. If the customer does not want to activate prefix list mode, the customer could deactivate prefix list mode from connected VPC account. If the customer does not accept or reject the resource share after more than 24 hours, then this alarm would be raised. "<br><br>When event resolved: "Resource share check failure is remediated. " |
If this alarm is not auto-resolved within 10 minutes, then execute the following steps: |
4.1.2 |
| Tgw Get Attachment Failure | High | manager | Failure in fetching TGW attachment. <br><br>When event detected: "Background TGW routes update task failed to get TGW attachment related info. Possibilities of hitting this alert is very low, except regression in service or AWS side. We have added this alarm to identify the issue before customer will notice any connectivity issue. "<br><br>When event resolved: "Failure with getting TGW attachment is remediated. " |
1. Log in to nsx manager. There are three manager nodes, we will need to find the leader node After logging in to one node, run command: -su admin -get cluster status verbose -Find out the TGW Leader node. |
4.1.2 |
| Tgw Attachment Mismatch Failure | High | manager | Failure due to mismatch of TGW attachments. <br><br>When event detected: "Background TGW routes update task failed to get TGW attachment related info. Possibilities of hitting this alert is very low, except regression in service or AWS side. We have added this alarm to identify the issue before customer will notice any connectivity issue. "<br><br>When event resolved: "Failure due to mismatch of TGW attachments is remediated. " |
1. Log in to nsx manager. There are three manager nodes, we will need to find the leader node After logging in to one node, run command: -su admin -get cluster status verbose -Find out the TGW Leader node. |
4.1.2 |
| Tgw Route Table Max Failure | High | manager | TGW Route table max entries failure. <br><br>When event detected: "TGW route capacity limit is reached which results in failure. "<br><br>When event resolved: "TGW Route table max entries failure is remediated. " |
1. Login to NSX manager UI in 'Networking & Security' tab. Then navigate to 'transit connect' tab. |
4.1.2 |
| Tgw Route Update Failure | High | manager | TGW Route update fails due to wrong TGW attachment size. <br><br>When event detected: "Transit Connect related configuration is not fully correctly realized. Possible issues could be: -Background TGW routes update task ran into issue. -Stale associated-groups object will cause failure in TGW route task. SDDC route advertisement and learning will be stuck which will have connectivity issue. "<br><br>When event resolved: "TGW Route update failure is remediated. " |
1. Run the following API 'GET /cloud-service/api/v1/infra/associated-groups'. The number of associated groups should only return 1 or 0. a) If the above API returns more than 1 associated groups, then do the following -Login to VMC UI and navigate to 'SDDC Groups' tab. -Find the correct SDDC group which contains this SDDC by checking each group members if it contains the SDDC under question. -Remove stale associations by running following API 'DELETE /cloud-service/api/v1/infra/associated-groups/<association-id>'. |
4.1.2 |
| Tgw Tagging Mismatch Failure | High | manager | Failure due to mismatch of TGW tags. <br><br>When event detected: "For prefix list managed routes, source and region info need to be retrieved from prefix list tags. If tagging mismatch happens, check the tagging format below and create a Jira ticket to Skynet with detailed error messages and missing tags. "<br><br>When event resolved: "Failure due to mismatch of TGW tags is remediated. " |
If this alarm is not auto-resolved within 10 minutes, then execute the following steps: |
4.1.2 |
VPN Events
| Event Name | Severity | Node Type | Alert Message | Recommended Action | Release Introduced |
|---|---|---|---|---|---|
| IPsec Service Down | Medium | edge, autonomous-edge, public-cloud-gateway | IPsec service is down. <br><br>When event detected: "The IPsec service {entity_id} is down. Reason: {service_down_reason}. "<br><br>When event resolved: "The IPsec service {entity_id} is up. " |
1. Deactivate and reactivate the IPsec service from NSX Manager UI. |
3.2.0 |
| IPsec Policy Based Session Down | Medium | edge, autonomous-edge, public-cloud-gateway | Policy based IPsec VPN session is down. <br><br>When event detected: "The policy based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}. "<br><br>When event resolved: "The policy based IPsec VPN session {entity_id} is up. " |
Check IPsec VPN session configuration and resolve errors based on the session down reason. |
3.0.0 |
| IPsec Route Based Session Down | Medium | edge, autonomous-edge, public-cloud-gateway | Route based IPsec VPN session is down. <br><br>When event detected: "The route based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}. "<br><br>When event resolved: "The route based IPsec VPN session {entity_id} is up. " |
Check IPsec VPN session configuration and resolve errors based on the session down reason. |
3.0.0 |
| IPsec Policy Based Tunnel Down | Medium | edge, autonomous-edge, public-cloud-gateway | policy Based IPsec VPN tunnels are down. <br><br>When event detected: "One or more policy based IPsec VPN tunnels in session {entity_id} are down. "<br><br>When event resolved: "All policy based IPsec VPN tunnels in session {entity_id} are up. " |
Check IPsec VPN session configuration and resolve errors based on the tunnel down reason. |
3.0.0 |
| IPsec Route Based Tunnel Down | Medium | edge, autonomous-edge, public-cloud-gateway | Route based IPsec VPN tunnel is down. <br><br>When event detected: "The route based IPsec VPN tunnel in session {entity_id} is down. Reason: {tunnel_down_reason}. "<br><br>When event resolved: "The route based IPsec VPN tunnel in session {entity_id} is up. " |
Check IPsec VPN session configuration and resolve errors based on the tunnel down reason. |
3.0.0 |
| L2VPN Session Down | Medium | edge, autonomous-edge, public-cloud-gateway | L2VPN session is down. <br><br>When event detected: "The L2VPN session {entity_id} is down. "<br><br>When event resolved: "The L2VPN session {entity_id} is up. " |
Check L2VPN session status for session down reason and resolve errors based on the reason. |
3.0.0 |
Back to: VMware NSX Documentation