NSX Event Catalog

The following tables describe events that trigger alarms in VMware NSX®, including alarm messages and recommended actions to resolve them. Any event with a severity greater than LOW triggers an alarm. Alarms information is displayed in several locations within the NSX Manager interface. Alarm and event information is also included with other notifications in the Notifications drop-down menu in the title bar. To view alarms, navigate to the Home page and click the Alarms tab. For more information on alarms and events, see " Working with Events and Alarms" in the NSX Administration Guide.

Alarm Management Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Alarm Service Overloaded Critical global-manager, manager

The alarm service is overloaded.
When event detected: "Due to heavy volume of alarms reported, the alarm service is temporarily overloaded. The NSX UI and GET /api/v1/alarms NSX API have stopped reporting new alarms; however, syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of alarms are addressed, the alarm service will start reporting new alarms again. "
When event resolved: "The heavy volume of alarms has subsided and new alarms are being reported again. "

Review all active alarms using the Alarms page in the NSX UI or using the GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED NSX API. For each active alarm investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new alarms again.

3.0.0
Heavy Volume Of Alarms Critical global-manager, manager

Heavy volume of a specific alarm type detected.
When event detected: "Due to heavy volume of {event_id} alarms, the alarm service has temporarily stopped reporting alarms of this type. The NSX UI and GET /api/v1/alarms NSX API are not reporting new instances of these alarms; however, syslog entries and SNMP traps (if enabled) are still being emitted reporting the underlying event details. When the underlying issues causing the heavy volume of {event_id} alarms are addressed, the alarm service will start reporting new {event_id} alarms when new issues are detected again. "
When event resolved: "The heavy volume of {event_id} alarms has subsided and new alarms of this type are being reported again. "

Review all active alarms of type {event_id} using the Alarms page in the NSX UI or using the NSX API GET /api/v1/alarms?status=OPEN,ACKNOWLEDGED,SUPPRESSED. For each active alarm investigate the root cause by following the recommended action for the alarm. When sufficient alarms are resolved, the alarm service will start reporting new {event_id} alarms again.

3.0.0

Audit Log Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Audit Log File Update Error Critical global-manager, manager, edge, public-cloud-gateway, esx, kvm, bms

At least one of the monitored log files cannot be written to.
When event detected: "At least one of the monitored log files has read-only permissions or has incorrect user/group ownership on Manager, Global Manager, Edge, Public Cloud Gateway, KVM or Linux Physical Server nodes. Or log folder is missing in Windows Physical Server nodes. Or rsyslog.log is missing on Manager, Global Manager, Edge or Public Cloud Gateway nodes. "
When event resolved: "All monitored log files have the correct file permissions and ownership on Manager, Global Manager, Edge, Public Cloud Gateway, KVM or Linux Physical Server nodes. And log folder exists on Windows Physical Server nodes. And rsyslog.log exists on Manager, Global Manager, Edge or Public Cloud Gateway nodes. "

1. On Manager and Global Managaer nodes, Edge and Public Cloud Gateway nodes, Ubuntu KVM Host nodes ensure the permissions for the /var/log directory is 775 and the ownership is root:syslog. One Rhel KVM and BMS Host nodes ensure the permission for the /var/log directory is 755 and the ownership is root:root.
2. On Manager and Global Manager nodes, ensure the file permissions for auth.log, nsx-audit.log, nsx-audit-write.log, rsyslog.log and syslog under /var/log is 640 and ownership is syslog:admin.
3. On Edge and Public Cloud Gateway nodes, ensure the file permissions for rsyslog.log and syslog under /var/log is 640 and ownership is syslog:admin.
4. On Ubuntu KVM Host and Ubuntu Physical Server nodes, ensure the file permissions of auth.log and vmware/nsx-syslog under /var/log is 640 and ownership is syslog:admin.
5. On Rhel KVM Host nodes and Centos/Rhel/Sles Physical Server nodes, ensure the file permission of vmware/nsx-syslog under /var/log is 640 and ownership is root:root.
6. If any of these files have incorrect permissions or ownership, invoke the commands chmod <mode> <path> and chown <user>:<group> <path>.
7. If rsyslog.log is missing on Manager, Global Manager, Edge or Public Cloud Gateway nodes, invoke the NSX CLI command restart service syslog which restarts the logging service and regenerates /var/log/rsyslog.log.
8. On Windows Physical Server nodes, ensure the log folder: C:\ProgramData\VMware\NSX\Logs exists. If not, re-install NSX on the Windows Physical Server nodes.

3.1.0
Remote Logging Server Error Critical global-manager, manager, edge, public-cloud-gateway

Log messages undeliverable due to incorrect remote logging server configuration.
When event detected: "Log messages to logging server {hostname_or_ip_address_with_port} ({entity_id}) cannot be delivered possibly due to an unresolvable FQDN, an invalid TLS certificate or missing NSX appliance iptables rule. "
When event resolved: "Configuration for logging server {hostname_or_ip_address_with_port} ({entity_id}) appear correct. "

1. Ensure that {hostname_or_ip_address_with_port} is the correct hostname or IP address and port.
2. If the logging server is specified using a FQDN, ensure the FQDN is resolvable from the NSX appliance using the NSX CLI command nslookup <fqdn>. If not resolvable, verify the correct FQDN is specified and the network DNS server has the required entry for the FQDN.
3. If the logging server is configured to use TLS, verify the specified certificate is valid. For example, ensure the logging server is actually using the certificate or verify the certificate has not expired using the openssl command openssl x509 -in <cert-file-path> -noout -dates.
4. NSX appliances use iptables rules to explicitly allow outgoing traffic. Verify the iptables rule for the logging server is configured properly by invoking the NSX CLI command verify logging-servers which re-configures logging server iptables rules as needed.
5. If for any reason the logging server is misconfigured, it should be deleted using the NSX CLI `del logging-server <hostname-or-ip-address[:port]> proto <proto> level <level>` command and re-added with the correct configuration.

3.1.0

Capacity Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Minimum Capacity Threshold Medium manager

A minimum capacity threshold has been breached.
When event detected: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is above the minimum capacity threshold of {min_capacity_threshold}%. "
When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is at or below the minimum capacity threshold of {min_capacity_threshold}%. "

Navigate to the capacity page in the NSX UI and review current usage versus threshold limits. If the current usage is expected, consider increasing the minimum threshold values. If the current usage is unexpected, review the network policies configured to decrease usage at or below the minimum threshold.

3.1.0
Maximum Capacity Threshold High manager

A maximum capacity threshold has been breached.
When event detected: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is above the maximum capacity threshold of {max_capacity_threshold}%. "
When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is at or below the maximum capacity threshold of {max_capacity_threshold}%. "

Navigate to the capacity page in the NSX UI and review current usage versus threshold limits. If the current usage is expected, consider increasing the maximum threshold values. If the current usage is unexpected, review the network policies configured to decrease usage at or below the maximum threshold.

3.1.0
Maximum Capacity Critical manager

A maximum capacity has been breached.
When event detected: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} which is above the maximum supported count of {max_supported_capacity_count}. "
When event resolved: "The number of objects defined in the system for {capacity_display_name} has reached {capacity_usage_count} and is at or below the maximum supported count of {max_supported_capacity_count}. "

Ensure that the number of NSX objects created is within the limits supported by NSX. If there are any unused objects, delete them using the respective NSX UI or API from the system. Consider increasing the form factor of all Manager nodes and/or Edge nodes. Note that the form factor of each node type should be the same. If not the same, the capacity limits for the lowest form factor deployed are used.

3.1.0

Certificates Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Certificate Expired Critical global-manager, manager

A certificate has expired.
When event detected: "Certificate {entity_id} has expired. "
When event resolved: "The expired certificate {entity_id} has been removed or is no longer expired. "

Ensure services that are currently using the certificate are updated to use a new, non-expired certificate. Once the expired certificate is no longer in use, it should be deleted by invoking the DELETE {api_collection_path}{entity_id} NSX API. If the expired certificate is used by NAPP Platform, the connection is broken between NSX and NAPP Platform. Check the NAPP Platform troubleshooting document to use a self-signed NAPP CA certificate for recovering the connection.

3.0.0
Certificate Is About To Expire High global-manager, manager

A certificate is about to expire.
When event detected: "Certificate {entity_id} is about to expire. "
When event resolved: "The expiring certificate {entity_id} has been removed or is no longer about to expire. "

Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. Once the expiring certificate is no longer in use, it should be deleted by invoking the DELETE {api_collection_path}{entity_id} NSX API.

3.0.0
Certificate Expiration Approaching Medium global-manager, manager

A certificate is approaching expiration.
When event detected: "Certificate {entity_id} is approaching expiration. "
When event resolved: "The expiring certificate {entity_id} has been removed or is no longer approaching expiration. "

Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. Once the expiring certificate is no longer in use, it should be deleted by invoking the DELETE {api_collection_path}{entity_id} NSX API.

3.0.0
CA Bundle Update Recommended High global-manager, manager

The update for a trusted CA bundle is recommended.
When event detected: "The trusted CA bundle {entity_id} was updated more than {ca_bundle_age_threshold} days ago. Update for the trusted CA bundle is recommended. "
When event resolved: "The trusted CA bundle {entity_id} has been removed, updated, or is no longer in use. "

Ensure services that are currently using the trusted CA bundle are updated to use a recently-updated trusted CA bundle. Unless it is system-provided bundle, the bundle can be updated using the PUT /policy/api/v1/infra/cabundles/{entity_id} NSX API. Once the expired bundle is no longer in use, it should be deleted (if not system-provided) by invoking the DELETE /policy/api/v1/infra/cabundles/{entity_id} NSX API.

3.2.0
CA Bundle Update Suggested Medium global-manager, manager

The update for a trusted CA bundle is suggested.
When event detected: "The trusted CA bundle {entity_id} was updated more than {ca_bundle_age_threshold} days ago. Update for the trusted CA bundle is suggested. "
When event resolved: "The trusted CA bundle {entity_id} has been removed, updated, or is no longer in use. "

Ensure services that are currently using the trusted CA bundle are updated to use a recently-updated trusted CA bundle. Unless it is system-provided bundle, the bundle can be updated using the PUT /policy/api/v1/infra/cabundles/{entity_id} NSX API. Once the expired bundle is no longer in use, it should be deleted (if not system-provided) by invoking the DELETE /policy/api/v1/infra/cabundles/{entity_id} NSX API.

3.2.0
Transport Node Certificate Expired Critical bms, edge, esx, kvm, public-cloud-gateway

A certificate has expired.
When event detected: "Certificate has expired for Transport node {entity_id}. "
When event resolved: "The expired certificate for Transport node {entity_id} has been replaced or is no longer expired. "

Replace the Transport node {entity_id} certificate with a non-expired certificate. The expired certificate should be replaced by invoking the POST /api/v1/trust-management/certificates/action/replace-host-certificate/{entity_id} NSX API. If the expired certificate is used by Transport node, the connection is broken between Transport node and Manager node.

4.1.0
Transport Node Certificate Is About To Expire High bms, edge, esx, kvm, public-cloud-gateway

A certificate is about to expire.
When event detected: "Certificate for Transport node {entity_id} is about to expire. "
When event resolved: "The expiring certificate for Transport node {entity_id} has been removed or is no longer about to expire. "

Replace the Transport node {entity_id} certificate with a non-expired certificate. The expired certificate should be replaced by invoking the POST /api/v1/trust-management/certificates/action/replace-host-certificate/{entity_id} NSX API. If the certificate is not replaced, when the certificate expires the connection between the Transport node and the Manager node will be broken.

4.1.0
Transport Node Certificate Expiration Approaching Medium bms, edge, esx, kvm, public-cloud-gateway

A certificate is approaching expiration.
When event detected: "Certificate for Transport node {entity_id} is approaching expiration. "
When event resolved: "The expiring certificate for Transport node {entity_id} has been removed or is no longer approaching expiration. "

Replace the Transport node {entity_id} certificate with a non-expired certificate. The expired certificate should be replaced by invoking the POST /api/v1/trust-management/certificates/action/replace-host-certificate/{entity_id} NSX API. If the certificate is not replaced, when the certificate expires the connection between the Transport node and the Manager node will be broken.

4.1.0

Clustering Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Cluster Degraded Medium global-manager, manager

Group member is down.
When event detected: "Group member {manager_node_id} of service {group_type} is down. "
When event resolved: "Group member {manager_node_id} of {group_type} is up. "

1. Invoke the NSX CLI command 'get cluster status' to view the status of group members of the cluster.
2. Ensure the service for {group_type} is running on node. Invoke the GET /api/v1/node/services/<service_name>/status NSX API or the get service <service_name> NSX CLI command to determine if the service is running. If not running, invoke the POST /api/v1/node/services/<service_name>?action=restart NSX API or the restart service <service_name> NSX CLI to restart the service.
3. Check /var/log/ of service {group_type} to see if there are errors reported.

3.2.0
Cluster Unavailable High global-manager, manager

All the group members of the service are down.
When event detected: "All group members {manager_node_ids} of service {group_type} are down. "
When event resolved: "All group members {manager_node_ids} of service {group_type} are up. "

1. Ensure the service for {group_type} is running on node. Invoke the GET /api/v1/node/services/<service_name>/status NSX API or the get service <service_name> NSX CLI command to determine if the service is running. If not running, invoke the POST /api/v1/node/services/<service_name>?action=restart NSX API or the restart service <service_name> NSX CLI to restart the service.
2. Check /var/log/ of service {group_type} to see if there are errors reported.

3.2.0

Cni Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Hyperbus Manager Connection Down On DPU Medium dpu

Hyperbus on DPU cannot communicate with the Manager node.
When event detected: "Hyperbus on DPU {dpu_id} cannot communicate with the Manager node. "
When event resolved: "Hyperbus on DPU {dpu_id} can communicate with the Manager node. "

The hyperbus vmkernel interface (vmk50) on DPU {dpu_id} may be missing. Refer to Knowledge Base article https://kb.vmware.com/s/article/67432 .

4.0.0
Hyperbus Manager Connection Down Medium esx, kvm

Hyperbus cannot communicate with the Manager node.
When event detected: "Hyperbus cannot communicate with the Manager node. "
When event resolved: "Hyperbus can communicate with the Manager node. "

The hyperbus vmkernel interface (vmk50) may be missing. Refer to Knowledge Base article https://kb.vmware.com/s/article/67432 .

3.0.0

Communication Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Limited Reachability On DPU Medium dpu

The given collector can not be reached via vmknic(s) on given DVS on DPU.
When event detected: "The {vertical_name} collector {collector_ip} can not be reached via vmknic(s)(stack {stack_alias}) on DVS {dvs_alias} on DPU {dpu_id}, but is reachable via vmknic(s)(stack {stack_alias}) on other DVS(es). "
When event resolved: "The {vertical_name} collector {collector_ip} can be reached via vmknic(s) (stack {stack_alias}) on DVS {dvs_alias} on DPU {dpu_id}, or the {vertical_name} collector {collector_ip} is unreachable completely. "

If the warning is on, it does not mean the collector is unreachable. The exported flows generated by the vertical based on DVS {dvs_alias} can still reach the collector {collector_ip} via vmknic(s) on DVS(es) besides of DVS {dvs_alias}. If this is unacceptable, user can try to create vmknic(s) with stack {stack_alias} on DVS {dvs_alias} and configure it with appropriate IPv4(6) address, then check if the {vertical_name} collector {collector_ip} can be reached via the newly-created vmknic(s) on DPU {dpu_id} by invoking vmkping {collector_ip} -S {stack_alias} -I vmkX with SSH to DPU via ESXi enabled.

4.0.1
Unreachable Collector On DPU Critical dpu

The given collector can not be reached via existing vmknic(s) on DPU at all.
When event detected: "The {vertical_name} collector {collector_ip} can not be reached via existing vmknic(s)(stack {stack_alias}) on any DVS on DPU {dpu_id}. "
When event resolved: "The {vertical_name} collector {collector_ip} can be reached with existing vmknic(s)(stack {stack_alias}) now on DPU {dpu_id}. "

To make the collector reachable for given vertical on the DVS, user has to make sure there is(are) vmknic(s) with expected stack {stack_alias} created and configured with appropriate IPv4(6) addresses, and the network connection to {vertical_name} collector {collector_ip} is also fine. So user has to do the checking on DPU {dpu_id}, and perform required configuration to make sure the condition is met. Finally if vmkping {collector_ip} -S {stack_alias} with SSH to DPU via ESXi enabled succeeds, this indicates the problem is gone.

4.0.1
Manager Cluster Latency High Medium manager

The average network latency between Manager nodes is high.
When event detected: "The average network latency between Manager nodes {manager_node_id} ({appliance_address}) and {remote_manager_node_id} ({remote_appliance_address}) is more than 10ms for the last 5 minutes. "
When event resolved: "The average network latency between Manager nodes {manager_node_id} ({appliance_address}) and {remote_manager_node_id} ({remote_appliance_address}) is within 10ms. "

Ensure there are no firewall rules blocking ping traffic between the Manager nodes. If there are other high bandwidth servers and applications sharing the local network, consider moving these to a different network.

3.1.0
Control Channel To Manager Node Down Too Long Critical bms, edge, esx, kvm, public-cloud-gateway

Transport node's control plane connection to the Manager node is down for long.
When event detected: "The Transport node {entity_id} control plane connection to Manager node {appliance_address} is down for at least {timeout_in_minutes} minutes from the Transport node's point of view. "
When event resolved: "The Transport node {entity_id} restores the control plane connection to Manager node {appliance_address}. "

1. Check the connectivity from Transport node {entity_id} to Manager node {appliance_address} interface via ping. If they are not pingable, check for flakiness in network connectivity.
2. Check to see if the TCP connections are established using the netstat output to see if the Controller service on the Manager node {appliance_address} is listening for connections on port 1235. If not, check firewall (or) iptables rules to see if port 1235 is blocking Transport node {entity_id} connection requests. Ensure that there are no host firewalls or network firewalls in the underlay are blocking the required IP ports between Manager nodes and Transport nodes. This is documented in our ports and protocols tool which is here: https://ports.vmware.com/ .
3. It is possible that the Transport node {entity_id} may still be in maintenance mode. You can check whether the Transport node is in maintenance mode via the following API: GET https://<nsx-mgr>/api/v1/transport-nodes/<tn-uuid>. When maintenance mode is set, the Transport node will not be connected to the Controller service. This is usually the case when host upgrade is in progress. Wait for a few minutes and check connectivity again.

3.1.0
Control Channel To Manager Node Down Medium bms, edge, esx, kvm, public-cloud-gateway

Transport node's control plane connection to the Manager node is down.
When event detected: "The Transport node {entity_id} control plane connection to Manager node {appliance_address} is down for at least {timeout_in_minutes} minutes from the Transport node's point of view. "
When event resolved: "The Transport node {entity_id} restores the control plane connection to Manager node {appliance_address}. "

1. Check the connectivity from Transport node {entity_id} to Manager node {appliance_address} interface via ping. If they are not pingable, check for flakiness in network connectivity.
2. Check to see if the TCP connections are established using the netstat output to see if the Controller service on the Manager node {appliance_address} is listening for connections on port 1235. If not, check firewall (or) iptables rules to see if port 1235 is blocking Transport node {entity_id} connection requests. Ensure that there are no host firewalls or network firewalls in the underlay are blocking the required IP ports between Manager nodes and Transport nodes. This is documented in our ports and protocols tool which is here: https://ports.vmware.com/ .
3. It is possible that the Transport node {entity_id} may still be in maintenance mode. You can check whether the Transport node is in maintenance mode via the following API: GET https://<nsx-mgr>/api/v1/transport-nodes/<tn-uuid> When maintenance mode is set, the Transport node will not be connected to the Controller service. This is usually the case when host upgrade is in progress. Wait for a few minutes and check connectivity again. Note: This alarm is not critical and should be resolved. GSS need not be contacted for the notification of this alarm unless the alarm remains unresolved over an extended period of time.

3.1.0
Control Channel To Transport Node Down Medium manager

Controller service to Transport node's connection is down.
When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Transport node {transport_node_name} ({entity_id}) down for at least three minutes from Controller service's point of view. "
When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Transport node {entity_id}. "

1. Check the connectivity from the Controller service {central_control_plane_id} and Transport node {entity_id} interface via ping and traceroute. This can be done on the NSX Manager node admin CLI. The ping test should not see drops and have consistent latency values. VMware recommends latency values of 150ms or less.
2. Navigate to System | Fabric | Nodes | Transport node {entity_id} on the NSX UI to check if the TCP connections between the Controller service on Manager node {appliance_address} ({central_control_plane_id}) and Transport node {entity_id} is established. If not, check firewall rules on the network and the hosts to see if port 1235 is blocking Transport node {entity_id} connection requests. Ensure that there are no host firewalls or network firewalls in the underlay are blocking the required IP ports between Manager nodes and Transport nodes. This is documented in our ports and protocols tool which is here: https://ports.vmware.com/ .

3.1.0
Control Channel To Transport Node Down Long Critical manager

Controller service to Transport node's connection is down for too long.
When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Transport node {transport_node_name} ({entity_id}) down for at least 15 minutes from Controller service's point of view. "
When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Transport node {entity_id}. "

1. Check the connectivity from the Controller service {central_control_plane_id} and Transport node {entity_id} interface via ping and traceroute. This can be done on the NSX Manager node admin CLI. The ping test should not see drops and have consistent latency values. VMware recommends latency values of 150ms or less.
2. Navigate to System | Fabric | Nodes | Transport node {entity_id} on the NSX UI to check if the TCP connections between the Controller service on Manager node {appliance_address} ({central_control_plane_id}) and Transport node {entity_id} is established. If not, check firewall rules on the network and the hosts to see if port 1235 is blocking Transport node {entity_id} connection requests. Ensure that there are no host firewalls or network firewalls in the underlay are blocking the required IP ports between Manager nodes and Transport nodes. This is documented in our ports and protocols tool which is here: https://ports.vmware.com/ .

3.1.0
Control Channel To Antrea Cluster Down Medium manager

Controller service to Antrea cluster's connection is down.
When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Antrea cluster {antrea_cluster_node_name} ({entity_id}) down for at least three minutes from Controller service's point of view. "
When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Antrea cluster {entity_id}. "

1. Check if the Antrea Kubernetes cluster is deleted.
2. Check Control Plane network connectivity issue.
3. Make sure the Antrea adapter has not crashed or deleted.
4. Check if there are issues with the client certificate used for Antrea to NSX integration.
5. Check the adapter version and make sure it is compatible with Antrea version. Refer to Administration Guide for additional troubleshooting details.

4.1.1
Control Channel To Antrea Cluster Down Long Critical manager

Controller service to Antrea cluster's connection is down for too long.
When event detected: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) to Antrea cluster {antrea_cluster_node_name} ({entity_id}) down for at least 15 minutes from Controller service's point of view. "
When event resolved: "Controller service on Manager node {appliance_address} ({central_control_plane_id}) restores connection to Antrea cluster {entity_id}. "

1. Check if the Antrea Kubernetes cluster is deleted.
2. Check Control Plane network connectivity issue.
3. Make sure the Antrea adapter has not crashed or deleted.
4. Check if there are issues with the client certificate used for Antrea to NSX integration.
5. Check the adapter version and make sure it is compatible with Antrea version. Refer to Administration Guide for additional troubleshooting details.

4.1.1
Manager Control Channel Down Critical manager

Manager to controller channel is down.
When event detected: "The communication between the management function and the control function has failed on Manager node {manager_node_name} ({appliance_address}). "
When event resolved: "The communication between the management function and the control function has been restored on Manager node {manager_node_name} ({appliance_address}). "

1. On Manager node {manager_node_name} ({appliance_address}), invoke the following NSX CLI command: get service applianceproxy to check the status of the service periodically for 60 minutes.
2. If the service is not running for more than 60 minutes, invoke the following NSX CLI command: restart service applianceproxy and recheck the status. If the service is still down, contact VMware Support.

3.0.2
Management Channel To Transport Node Down Medium manager

Management channel to Transport node is down.
When event detected: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is down for 5 minutes. "
When event resolved: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is up. "

Ensure there is network connectivity between the Manager nodes and Transport node {transport_node_name} ({transport_node_address}) and no firewalls are blocking traffic between the nodes. On Windows Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 status in the Windows PowerShell. If it is not running, restart it by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 restart. On all other Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command /etc/init.d/nsx-proxy status. If it is not running, restart it by invoking the command /etc/init.d/nsx-proxy restart.

3.0.2
Management Channel To Transport Node Down Long Critical manager

Management channel to Transport node is down for too long.
When event detected: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is down for 15 minutes. "
When event resolved: "Management channel to Transport Node {transport_node_name} ({transport_node_address}) is up. "

Ensure there is network connectivity between the Manager nodes and Transport node {transport_node_name} ({transport_node_address}) and no firewalls are blocking traffic between the nodes. On Windows Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 status in the Windows PowerShell. If it is not running, restart it by invoking the command C:\NSX\nsx-proxy\nsx-proxy.ps1 restart. On all other Transport nodes, ensure the nsx-proxy service is running on the Transport node by invoking the command /etc/init.d/nsx-proxy status. If it is not running, restart it by invoking the command /etc/init.d/nsx-proxy restart.

3.0.2
Manager FQDN Lookup Failure Critical global-manager, bms, edge, esx, kvm, manager, public-cloud-gateway

DNS lookup failed for Manager node's FQDN.
When event detected: "DNS lookup failed for Manager node {entity_id} with FQDN {appliance_fqdn} and the publish_fqdns flag was set. "
When event resolved: "FQDN lookup succeeded for Manager node {entity_id} with FQDN {appliance_fqdn} or the publish_fqdns flag was cleared. "

1. Assign correct FQDNs to all Manager nodes and verify the DNS configuration is correct for successful lookup of all Manager nodes' FQDNs.
2. Alternatively, no longer use FQDNs by invoking the NSX API PUT /api/v1/configs/management with publish_fqdns set to false in the request body. After that calls from Transport nodes and from Federation to Manager nodes in this cluster will use only IP addresses.

3.1.0
Manager FQDN Reverse Lookup Failure Critical global-manager, manager

Reverse DNS lookup failed for Manager node's IP address.
When event detected: "Reverse DNS lookup failed for Manager node {entity_id} with IP address {appliance_address} and the publish_fqdns flag was set. "
When event resolved: "Reverse DNS lookup succeeded for Manager node {entity_id} with IP address {appliance_address} or the publish_fqdns flag was cleared. "

1. Assign correct FQDNs to all Manager nodes and verify the DNS configuration is correct for successful reverse lookup of the Manager node's IP address.
2. Alternatively, no longer use FQDNs by invoking the NSX API PUT /api/v1/configs/management with publish_fqdns set to false in the request body. After that calls from Transport nodes and from Federation to Manager nodes in this cluster will use only IP addresses.

3.1.0
Management Channel To Manager Node Down Medium bms, edge, esx, kvm, public-cloud-gateway

Management channel to Manager node is down.
When event detected: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is down for 5 minutes. "
When event resolved: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is up. "

Ensure there is network connectivity between the Transport node {transport_node_id} and leader Manager node. Also ensure no firewalls are blocking traffic between the nodes. Ensure the messaging manager service is running on Manager nodes by invoking the command /etc/init.d/messaging-manager status. If the messaging manager is not running, restart it by invoking the command /etc/init.d/messaging-manager restart.

3.2.0
Management Channel To Manager Node Down Long Critical bms, edge, esx, kvm, public-cloud-gateway

Management channel to Manager node is down for too long.
When event detected: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is down for 15 minutes. "
When event resolved: "Management channel to Manager Node {manager_node_id} ({appliance_address}) is up. "

Ensure there is network connectivity between the Transport node {transport_node_id} and leader Manager nodes. Also ensure no firewalls are blocking traffic between the nodes. Ensure the messaging manager service is running on Manager nodes by invoking the command /etc/init.d/messaging-manager status. If the messaging manager is not running, restart it by invoking the command /etc/init.d/messaging-manager restart.

3.2.0
Network Latency High Medium manager

Management to Transport node network latency is high.
When event detected: "The average network latency between manager nodes and host {transport_node_name} ({transport_node_address}) is more than 150 ms for 5 minutes. "
When event resolved: "The average network latency between manager nodes and host {transport_node_name} ({transport_node_address}) is normal. "

1. Wait for 5 minutes to see if the alarm automatically gets resolved.
2. Ping the NSX Transport node from Manager node. The ping test should not see drops and have consistent latency values. VMware recommends latency values of 150ms or less.
3. Inspect for any other physical network layer issues. If the problem persists, contact VMware Support.

4.0.0

DHCP Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Pool Lease Allocation Failed High edge, autonomous-edge, public-cloud-gateway

IP addresses in an IP Pool have been exhausted.
When event detected: "The addresses in IP Pool {entity_id} of DHCP Server {dhcp_server_id} have been exhausted. The last DHCP request has failed and future requests will fail. "
When event resolved: "IP Pool {entity_id} of DHCP Server {dhcp_server_id} is no longer exhausted. A lease is successfully allocated to the last DHCP request. "

Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking | Segments | Segment page in the NSX UI.

3.0.0
Pool Overloaded Medium edge, autonomous-edge, public-cloud-gateway

An IP Pool is overloaded.
When event detected: "The DHCP Server {dhcp_server_id} IP Pool {entity_id} usage is approaching exhaustion with {dhcp_pool_usage}% IPs allocated. "
When event resolved: "The DHCP Server {dhcp_server_id} IP Pool {entity_id} has fallen below the high usage threshold. "

Review the DHCP pool configuration in the NSX UI or on the Edge node where the DHCP server is running by invoking the NSX CLI command get dhcp ip-pool. Also review the current active leases on the Edge node by invoking the NSX CLI command get dhcp lease. Compare the leases to the number of active VMs. Consider reducing the lease time on the DHCP server configuration if the number of VMs are low compared to the number of active leases. Also consider expanding the pool range for the DHCP server by visiting the Networking | Segments | Segment page in the NSX UI.

3.0.0

Distributed Firewall Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
DFW CPU Usage Very High Critical esx

DFW CPU usage is very high.
When event detected: "The DFW CPU usage on Transport node {transport_node_name} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The DFW CPU usage on Transport node {transport_node_name} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

Consider re-balancing the VM workloads on this host to other hosts. Review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter.

3.0.0
DFW CPU Usage Very High On DPU Critical dpu

DFW CPU usage is very high on dpu.
When event detected: "The DFW CPU usage on Transport node {transport_node_name} has reached {system_resource_usage}% on DPU {dpu_id} which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The DFW CPU usage on Transport node {transport_node_name} has reached {system_resource_usage}% on DPU {dpu_id} which is below the very high threshold value of {system_usage_threshold}%. "

Consider re-balancing the VM workloads on this host to other hosts. Review the security design for optimization. For example, use the apply-to configuration if the rules are not applicable to the entire datacenter.

4.0.0
DFW Memory Usage Very High Critical esx

DFW Memory usage is very high.
When event detected: "The DFW Memory usage {heap_type} on Transport node {transport_node_name} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The DFW Memory usage {heap_type} on Transport node {transport_node_name} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the host. Consider re-balancing the workloads on this host to other hosts.

3.0.0
DFW Memory Usage Very High On DPU Critical dpu

DFW Memory usage is very high on DPU.
When event detected: "The DFW Memory usage {heap_type} on Transport node {transport_node_name} has reached {system_resource_usage}% on DPU {dpu_id} which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The DFW Memory usage {heap_type} on Transport node {transport_node_name} has reached {system_resource_usage}% on DPU {dpu_id} which is below the very high threshold value of {system_usage_threshold}%. "

View the current DFW memory usage by invoking the NSX CLI command get firewall thresholds on the DPU. Consider re-balancing the workloads on this host to other hosts.

4.0.0
DFW Vmotion Failure Critical esx

DFW vMotion failed, port disconnected.
When event detected: "The DFW vMotion for DFW filter {entity_id} on destination host {transport_node_name} has failed and the port for the entity has been disconnected. "
When event resolved: "The DFW configuration for DFW filter {entity_id} on the destination host {transport_node_name} has succeeded and error caused by DFW vMotion failure cleared. "

Check VMs on the host in NSX Manager, manually repush the DFW configuration through NSX Manager UI. The DFW policy to be repushed can be traced by the DFW filter {entity_id}. Also consider finding the VM to which the DFW filter is attached and restart it.

3.2.0
DFW Flood Limit Warning Medium esx

DFW flood limit has reached warning level.
When event detected: "The DFW flood limit for DFW filter {entity_id} on host {transport_node_name} has reached warning level of 80% of the configured limit for protocol {protocol_name}. "
When event resolved: "The warning flood limit condition for DFW filter {entity_id} on host {transport_node_name} for protocol {protocol_name} is cleared. "

Check VMs on the host in NSX Manager, check configured flood warning level of the DFW filter {entity_id} for protocol {protocol_name}.

4.1.0
DFW Flood Limit Critical Critical esx

DFW flood limit has reached critical level.
When event detected: "The DFW flood limit for DFW filter {entity_id} on host {transport_node_name} has reached critical level of 98% of the configured limit for protocol {protocol_name}. "
When event resolved: "The critical flood limit condition for DFW filter {entity_id} on host {transport_node_name} for protocol {protocol_name} is cleared. "

Check VMs on the host in NSX Manager, check configured flood critical level of the DFW filter {entity_id} for protocol {protocol_name}.

4.1.0
DFW Session Count High Critical esx

DFW session count is high.
When event detected: "The DFW session count is high on Transport node {entity_id}, it has reached {system_resource_usage}% which is at or above the threshold value of {system_usage_threshold}%. "
When event resolved: "The DFW session count on Transport node {entity_id} has reached {system_resource_usage}% which is below the the threshold value of {system_usage_threshold}%. "

Review the network traffic load level of the workloads on the host. Consider re-balancing the workloads on this host to other hosts.

3.2.0
DFW Rules Limit Per vNIC Exceeded Critical esx

DFW rules limit per vNIC is about to exceed the maximum limit.
When event detected: "The DFW rules limit for VIF {entity_id} on destination host {transport_node_name} is about to exceed the maximum limit. "
When event resolved: "The DFW rules limit for VIF {entity_id} on the destination host {transport_node_name} dropped below maximum limit. "

Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall <VIF_UUID> ruleset rules to get the rule statistics for rules configured on the corresponding VIF. Reduce the number of rules configured for VIF {entity_id}.

4.0.0
DFW Rules Limit Per vNIC Approaching Medium esx

DFW rules limit per vNIC is approaching the maximum limit.
When event detected: "The DFW rules limit for VIF {entity_id} on destination host {transport_node_name} is approaching the maximum limit. "
When event resolved: "The DFW rules limit for VIF {entity_id} on the destination host {transport_node_name} dropped below the threshold. "

Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall <VIF_UUID> ruleset rules to get the rule statistics for rules configured on the corresponding VIF. Reduce the number of rules configured for VIF {entity_id}.

4.0.0
DFW Rules Limit Per Host Exceeded Critical esx

DFW rules limit per host is about to exceed the maximum limit.
When event detected: "The DFW rules limit for host {transport_node_name} is about to exceed the maximum limit. "
When event resolved: "The DFW rules limit for host {transport_node_name} dropped below maximum limit. "

Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall rule-stats total to get the rule statistics for rules configured on the ESX host {transport_node_name}. Reduce the number of rules configured for host {transport_node_name}. Check the number of rules configured for various VIFs by using NSX CLI command get firewall <VIF_UUID> ruleset rules. Reduce the number of rules configured for various VIFs.

4.0.0
DFW Rules Limit Per Host Approaching Medium esx

DFW rules limit per host is approaching the maximum limit.
When event detected: "The DFW rules limit for host {transport_node_name} is approaching the maximum limit. "
When event resolved: "The DFW rules limit for host {transport_node_name} dropped below the threshold. "

Log in into the ESX host {transport_node_name} and invoke the NSX CLI command get firewall rule-stats total to get the rule statistics for rules configured on the ESX host {transport_node_name}. Reduce the number of rules configured for host {transport_node_name}. Check the number of rules configured for various VIFs by using NSX CLI command get firewall <VIF_UUID> ruleset rules. Reduce the number of rules configured for various VIFs.

4.0.0

Distributed IDS IPS Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Max Events Reached Medium manager

Max number of intrusion events reached.
When event detected: "The number of intrusion events in the system is {ids_events_count} which is higher than the maximum allowed value {max_ids_events_allowed}. "
When event resolved: "The number of intrusion events in the system is {ids_events_count} which is below the maximum allowed value {max_ids_events_allowed}. "

There is no manual intervention required. A purge job will kick in automatically every 3 minutes and delete 10% of the older records to bring the total intrusion events count in the system to below the threshold value of 1.5 million events.

3.1.0
NSX IDPS Engine Memory Usage High Medium esx

NSX-IDPS engine memory usage reaches 75% or above.
When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the high threshold value of 75%. "
When event resolved: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is below the high threshold value of 75%. "

Consider re-balancing the VM workloads on this host to other hosts.

3.1.0
NSX IDPS Engine Memory Usage High On DPU Medium dpu

NSX-IDPS engine memory usage reaches 75% or above on DPU.
When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the high threshold value of 75% on DPU {dpu_id}. "
When event resolved: "NSX-IDPS engine memory usage has reached on DPU {dpu_id}, {system_resource_usage}%, which is below the high threshold value of 75%. "

Consider re-balancing the VM workloads on this host to other hosts.

4.0.0
NSX IDPS Engine Memory Usage Medium High High esx

NSX-IDPS Engine memory usage reaches 85% or above.
When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the medium high threshold value of 85%. "
When event resolved: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is below the medium high threshold value of 85%. "

Consider re-balancing the VM workloads on this host to other hosts.

3.1.0
NSX IDPS Engine Memory Usage Medium High On DPU High dpu

NSX-IDPS Engine memory usage reaches 85% or above on DPU.
When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the medium high threshold value of 85% on DPU {dpu_id}. "
When event resolved: "NSX-IDPS engine memory usage has reached on DPU {dpu_id}, {system_resource_usage}%, which is below the medium high threshold value of 85%. "

Consider re-balancing the VM workloads on this host to other hosts.

4.0.0
NSX IDPS Engine Memory Usage Very High Critical esx

NSX-IDPS engine memory usage reaches 95% or above.
When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the very high threshold value of 95%. "
When event resolved: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is below the very high threshold value of 95%. "

Consider re-balancing the VM workloads on this host to other hosts.

3.1.0
NSX IDPS Engine Memory Usage Very High On DPU Critical dpu

NSX-IDPS engine memory usage reaches 95% or above on DPU.
When event detected: "NSX-IDPS engine memory usage has reached {system_resource_usage}%, which is at or above the very high threshold value of 95% on DPU {dpu_id}. "
When event resolved: "NSX-IDPS engine memory usage has reached on DPU {dpu_id}, {system_resource_usage}%, which is below the very high threshold value of 95%. "

Consider re-balancing the VM workloads on this host to other hosts.

4.0.0
NSX IDPS Engine CPU Usage High (deprecated) Medium esx

NSX-IDPS engine CPU usage reaches 75% or above.
When event detected: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is at or above the high threshold value of 75%. "
When event resolved: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is below the high threshold value of 75%. "

Consider re-balancing the VM workloads on this host to other hosts.

3.1.0
NSX IDPS Engine CPU Usage Medium High (deprecated) High esx

NSX-IDPS engine CPU usage reaches 85% or above.
When event detected: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is at or above the medium high threshold value of 85%. "
When event resolved: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is below the medium high threshold value of 85%. "

Consider re-balancing the VM workloads on this host to other hosts.

3.1.0
NSX IDPS Engine CPU Usage Very High (deprecated) Critical esx

NSX-IDPS engine CPU usage exceeded 95% or above.
When event detected: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is at or above the very high threshold value of 95%. "
When event resolved: "NSX-IDPS engine CPU usage has reached {system_resource_usage}%, which is below the very high threshold value of 95%. "

Consider re-balancing the VM workloads on this host to other hosts.

3.1.0
NSX IDPS Engine Down Critical esx

NSX IDPS is activated via NSX Policy and IDPS rules are configured, but NSX-IDPS engine is down.
When event detected: "NSX IDPS is activated via NSX policy and IDPS rules are configured, but NSX-IDPS engine is down. "
When event resolved: "NSX IDPS is in one of the cases below. 1. NSX IDPS is deactivated via NSX policy. 2. NSX IDPS engine is activated, NSX-IDPS engine and vdpi are up, and NSX IDPS has been activated and IDPS rules are configured via NSX Policy. "

1. Check /var/log/nsx-syslog.log to see if there are errors reported.
2. Invoke the NSX CLI command get ids engine status to check if NSX Distributed IDPS is in deactivated state. If so, invoke /etc/init.d/nsx-idps start to start the service.
3. Invoke /etc/init.d/nsx-vdpi status to check if nsx-vdpi is running. If not, invoke /etc/init.d/nsx-vdpi start to start the service.

3.1.0
NSX IDPS Engine Down On DPU Critical dpu

NSX IDPS is activated via NSX Policy and IDPS rules are configured, but NSX-IDPS engine is down on DPU.
When event detected: "NSX IDPS is activated via NSX policy and IDPS rules are configured, but NSX-IDPS engine is down on DPU {dpu_id}. "
When event resolved: "NSX IDPS is in one of the cases below on DPU {dpu_id}. 1. NSX IDPS is deactivated via NSX policy. 2. NSX IDPS engine is activated, NSX-IDPS engine and vdpi are up, and NSX IDPS has been activated and IDPS rules are configured via NSX Policy. "

1. Check /var/log/nsx-idps/nsx-idps.log and /var/log/nsx-syslog.log to see if there are errors reported.
2. Invoke the NSX CLI command get ids engine status to check if NSX Distributed IDPS is in deactivated state. If so, invoke /etc/init.d/nsx-idps start to start the service.
3. Invoke /etc/init.d/nsx-vdpi status to check if nsx-vdpi is running. If not, invoke /etc/init.d/nsx-vdpi start to start the service.

4.0.0
IDPS Engine CPU Oversubscription High Medium esx

CPU utilization for distributed IDPS engine is high.
When event detected: "CPU utilization for the distributed IDPS engine is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "CPU utilization for the distributed IDPS engine is below the high threshold value of {system_usage_threshold}%. "

Review reason for oversubscription. Move certain applications to different host.

4.0.0
IDPS Engine CPU Oversubscription Very High High esx

CPU utilization for distributed IDPS engine is very high.
When event detected: "CPU utilization for the distributed IDPS engine is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "CPU utilization for the distributed IDPS engine is below the very high threshold value of {system_usage_threshold}%. "

Review reason for oversubscription. Move certain applications to different host.

4.0.0
IDPS Engine Network Oversubscription High Medium esx

Network utilization for distributed IDPS engine is high.
When event detected: "Network utilization for the distributed IDPS engine is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "Network utilization for the distributed IDPS engine is below the high threshold value of {system_usage_threshold}%. "

Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service.

4.0.0
IDPS Engine Network Oversubscription Very High High esx

Network utilization for distributed IDPS engine is very high.
When event detected: "Network utilization for the distributed IDPS engine is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "Network utilization for the distributed IDPS engine is below the very high threshold value of {system_usage_threshold}%. "

Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service.

4.0.0
IDPS Engine Dropped Traffic CPU Oversubscribed Critical esx

Distributed IDPS Engine Dropped Traffic due to CPU Oversubscription.
When event detected: "The IDPS engine has insufficient CPU resources and is unable to keep pace with the incoming traffic resulting in the excess traffic being dropped. For more details, login to the ESX host and issue the following command: nsxcli -c get dpi stats and look at oversubscription stats. "
When event resolved: "The distributed IDPS engine has adequate CPU resources and is not dropping any traffic. "

Review reason for oversubscription. Move certain applications to different host.

4.0.0
IDPS Engine Dropped Traffic Network Oversubscribed Critical esx

Distributed IDPS Engine Dropped Traffic due to Network Oversubscription.
When event detected: "The IDPS engine is unable to keep pace with the rate of incoming traffic resulting in the excess traffic being dropped. For more details, login to the ESX host and issue the following command: nsxcli -c get dpi stats and look at oversubscription stats. "
When event resolved: "The distributed IDPS engine is not dropping any traffic. "

Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service.

4.0.0
IDPS Engine Bypassed Traffic CPU Oversubscribed Critical esx

Distributed IDPS Engine Bypassed Traffic due to CPU Oversubscription.
When event detected: "The IDPS engine has insufficient CPU resources and is unable to keep pace with the incoming traffic resulting in the excess traffic being bypassed. For more details, login to the ESX host and issue the following command: nsxcli -c get dpi stats and look at oversubscription stats. "
When event resolved: "The distributed IDPS engine has adequate CPU resources and is not bypassing any traffic. "

Review reason for oversubscription. Move certain applications to different host.

4.0.0
IDPS Engine Bypassed Traffic Network Oversubscribed Critical esx

Distributed IDPS Engine Bypassed Traffic due to Network Oversubscription.
When event detected: "The IDPS engine is unable to keep pace with the rate of incoming traffic resulting in the excess traffic being bypassed. For more details, login to the ESX host and issue the following command: nsxcli -c get dpi stats and look at oversubscription stats. "
When event resolved: "The distributed IDPS engine is not bypassing any traffic. "

Review reason for oversubscription. Review the IDPS rules to reduce the amount of traffic being subject to IDPS service.

4.0.0
Site Connection Loss Medium manager

Connection between IDPS Reporting Service and NSX+ failed.
When event detected: "Unable to establish connection between IDPS Reporting Service and NSX+. "
When event resolved: "Connection between IDPS Reporting Service and NSX+ has been restored. "

Restart IDPS Reporting Service on all manager nodes.

4.1.1

DNS Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Forwarder Down High edge, autonomous-edge, public-cloud-gateway

A DNS forwarder is down.
When event detected: "DNS forwarder {entity_id} is not running. This is impacting the identified DNS Forwarder that is currently activated. "
When event resolved: "DNS forwarder {entity_id} is running again. "

1. Invoke the NSX CLI command get dns-forwarders status to verify if the DNS forwarder is in down state.
2. Check /var/log/syslog to see if there are errors reported.
3. Collect a support bundle and contact the NSX support team.

3.0.0
Forwarder Disabled (deprecated) Info edge, autonomous-edge, public-cloud-gateway

A DNS forwarder is deactivated.
When event detected: "DNS forwarder {entity_id} is deactivated. "
When event resolved: "DNS forwarder {entity_id} is activated. "

1. Invoke the NSX CLI command get dns-forwarders status to verify if the DNS forwarder is in the deactivated state.
2. Use NSX Policy API or Manager API to activate the DNS forwarder. It should not be in the deactivated state.

3.0.0
Forwarder Upstream Server Timeout High edge, autonomous-edge, public-cloud-gateway

One DNS forwarder upstream server has timed out.
When event detected: "DNS forwarder {intent_path}({dns_id}) did not receive a timely response from upstream server {dns_upstream_ip}. Compute instance connectivity to timed out FQDNs may be impacted. "
When event resolved: "DNS forwarder {intent_path}({dns_id}) upstream server {dns_upstream_ip} is normal. "

1. Invoke the NSX API GET /api/v1/dns/forwarders/{dns_id}/nslookup? address=<address>&server_ip={dns_upstream_ip}&source_ip=<source_ip>. This API request triggers a DNS lookup to the upstream server in the DNS forwarder's network namespace. <address> is the IP address or FQDN in the same domain as the upstream server. <source_ip> is an IP address in the upstream server's zone. If the API returns a connection timed out response, there is likely a network error or upstream server problem. Check why DNS lookups are not reaching the upstream server or why the upstream server is not returning a response. If the API response indicates the upstream server is answering, proceed to step 2.
2. Invoke the NSX API GET /api/v1/dns/forwarders/{dns_id}/nslookup? address=<address>. This API request triggers a DNS lookup to the DNS forwarder. If the API returns a valid response, the upstream server may have recovered and this alarm should get resolved within a few minutes. If the API returns a connection timed out response, proceed to step 3.
3. Invoke the NSX CLI command `get dns-forwarder {dns_id} live-debug server-ip {dns_upstream_ip}`. This command triggers live debugging on the upstream server and logs details and statistics showing why the DNS forwarder is not getting a response.

3.1.3

Edge Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Edge Node Settings Mismatch Critical manager

Edge node settings mismatch.
When event detected: "The Edge node {entity_id} settings configuration does not match the policy intent configuration. The Edge node configuration visible to user on UI or API is not same as what is realized. The realized Edge node changes made by user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for the Edge node are listed in runtime data {edge_node_setting_mismatch_reason} "
When event resolved: "Edge node {entity_id} node settings are consistent with policy intent now. "

Review the node settings of this Edge transport node {entity_id}. Follow one of following actions to resolve alarm -
1. Manually update Edge transport node setting Policy intent using API - PUT https://<manager-ip>/api/v1/transport-nodes/<tn-id>.
2. Accept intent or realized Edge node settings for this Edge transport node through Edge transport node resolver to resolve this alarm.
3. Resolve alarm by accepting the Edge node settings configuration using refresh API - POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=refresh_node_configuration&resource_type=EdgeNode.

3.2.0
Edge VM vSphere Settings Mismatch Critical manager

Edge VM vSphere settings mismatch.
When event detected: "The Edge node {entity_id} configuration on vSphere does not match the policy intent configuration. The Edge node configuration visible to user on UI or API is not same as what is realized. The realized Edge node changes made by user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for the Edge node are listed in runtime data {edge_vm_vsphere_settings_mismatch_reason} "
When event resolved: "Edge node {entity_id} VM vSphere settings are consistent with policy intent now. "

Review the vSphere configuration of this Edge Transport Node {entity_id}. Follow one of following actions to resolve alarm -
1. Accept intent or vSphere realized Edge node configuration for this Edge Transport Node through Edge Transport Node resolver to resolve this alarm.
2. Resolve alarm by accepting the Edge node vSphere realized configuration using refresh API - POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=refresh_node_configuration&resource_type=EdgeNode.

3.2.0
Edge Node Settings And vSphere Settings Are Changed Critical manager

Edge node settings and vSphere settings are changed.
When event detected: "The Edge node {entity_id} settings and vSphere configuration are changed and does not match the policy intent configuration. The Edge node configuration visible to user on UI or API is not same as what is realized. The realized Edge node changes made by user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for Edge node settings and vSphere configuration are listed in runtime data {edge_node_settings_and_vsphere_settings_mismatch_reason} "
When event resolved: "Edge node {entity_id} node settings and vSphere settings are consistent with policy intent now. "

Review the node settings and vSphere configuration of this Edge Transport Node {entity_id}. Follow one of following actions to resolve alarm -
1. Manually update Edge Transport Node setting Policy intent using API : PUT https://<manager-ip>/api/v1/transport-nodes/<tn-id>.
2. Accept intent or vSphere realized Edge node configuration or realized Edge node settings for this Edge Transport Node through Edge Transport Node resolver to resolve this alarm.
3. Resolve alarm by accepting the Edge node settings and vSphere realized configuration using refresh API - POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=refresh_node_configuration&resource_type=EdgeNode.

3.2.0
Edge vSphere Location Mismatch High manager

Edge vSphere Location Mismatch.
When event detected: "The Edge node {entity_id} has been moved using vMotion. The Edge node {entity_id}, the configuration on vSphere does not match the policy intent configuration. The Edge node configuration visible to user on UI or API is not same as what is realized. The realized Edge node changes made by user outside of NSX Manager are shown in the details of this alarm. Fields that differ for the Edge node are listed in runtime data {edge_vsphere_location_mismatch_reason} "
When event resolved: "Edge node {entity_id} node vSphere settings are consistent with policy intent now. "

Review the vSphere configuration of this Edge Transport Node {entity_id}. Follow one of following actions to resolve alarm -
1. Resolve alarm by accepting the Edge node vSphere realized config using refresh API - POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=refresh_node_configuration&resource_type=EdgeNode.
2. If you want to return to the previous location, use NSX Redeploy API - POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=redeploy. vMotion back to the original host is not supported.

3.2.0
Edge VM Present In NSX Inventory Not Present In vCenter Critical manager

Auto Edge VM is present in NSX inventory but not present in vCenter.
When event detected: "The VM {policy_edge_vm_name} with moref id {vm_moref_id} corresponding to the Edge Transport node {entity_id} vSphere placement parameters is found in NSX inventory but is not present in vCenter. Check if the VM has been removed in vCenter or is present with a different VM moref id. "
When event resolved: "Edge node {entity_id} with VM moref id {vm_moref_id} is present in both NSX inventory and vCenter. "

The managed object reference moref id of a VM has the form vm-number, which is visible in the URL on selecting the Edge VM in vCenter UI. Example vm-12011 in https://<vc-url>/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-12011:164ff798-c4f1-495b-a0be-adfba337e5d2/summary Find the VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter for this Edge Transport Node {entity_id}. If the Edge VM is present in vCenter with a different moref id, follow the below action. Use NSX add or update placement API with JSON request payload properties vm_id and vm_deployment_config to update the new vm moref id and vSphere deployment parameters. POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=addOrUpdatePlacementReferences. If the Edge VM with name {policy_edge_vm_name} is not present in vCenter, use the NSX Redeploy API to deploy a new VM for the Edge node. POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=redeploy.

3.2.1
Edge VM Not Present In Both NSX Inventory And vCenter (deprecated) Critical manager

Auto Edge VM is not present in both NSX inventory and in vCenter.
When event detected: "The VM {policy_edge_vm_name} with moref id {vm_moref_id} corresponding to the Edge Transport node {entity_id} vSphere placement parameters is not found in both NSX inventory and vCenter. The placement parameters in the vSphere configuration of this Edge Transport node {entity_id} refer to the VM with moref {vm_moref_id}. "
When event resolved: "Edge node {entity_id} with VM moref id {vm_moref_id} is present in both NSX inventory and vCenter. "

The managed object reference moref id of a VM has the form vm-number, which is visible in the URL on selecting the Edge VM in vCenter UI. Example vm-12011 in https://<vc-url>/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-12011:164ff798-c4f1-495b-a0be-adfba337e5d2/summary Find the VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter for this Edge Transport Node {entity_id}. Follow the below action to resolve the alarm - Check if VM has been deleted in vSphere or is present with a different moref id.
1. If the VM is still present in vCenter, put the Edge Transport node in maintenance mode and then power off and delete the Edge VM in vCenter. Use the NSX Redeploy API to deploy a new VM for the Edge node. Data traffic for the Edge Transport node will be disrupted in the interim duration if the Edge VM is forwarding traffic.
2. If the VM is not present in vCenter, use the redeploy API to deploy a new VM for the Edge node. POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=redeploy.

3.2.1
Failed To Delete The Old VM In vCenter During Redeploy Critical manager

Power off and delete operation failed for old Edge VM in vCenter during Redeploy.
When event detected: "Failed to power off and delete the Edge node {entity_id} VM with moref id {vm_moref_id} in vCenter during Redeploy operation. A new Edge VM with moref id {new_vm_moref_id} has been deployed. Both old and new VMs for this Edge are functional at the same time and may result in IP conflicts and networking issues. "
When event resolved: "Edge node {entity_id} with stale VM moref id {vm_moref_id} is not found anymore in both NSX inventory and vCenter. New deployed VM with moref id {new_vm_moref_id} is present in both NSX inventory and vCenter. "

The managed object reference moref id of a VM has the form vm-number, which is visible in the URL on selecting the Edge VM in vCenter UI. Example vm-12011 in https://<vc-url>/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-12011:164ff798-c4f1-495b-a0be-adfba337e5d2/summary Find the VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter for this Edge Transport Node {entity_id}. Power off and delete the old Edge VM {policy_edge_vm_name} with moref id {vm_moref_id} in vCenter.

3.2.1
Edge Hardware Version Mismatch Medium manager

Edge node has hardware version mismatch.
When event detected: "The Edge node {transport_node_name} in Edge cluster {edge_cluster_name} has a hardware version {edge_tn_hw_version}, which is less than the highest hardware version {edge_cluster_highest_hw_version} in the Edge cluster. "
When event resolved: "The Edge node {transport_node_name} hardware version mismatch is resolved now. "

Please follow KB article to resolve hardware version mismatch alarm for Edge node {transport_node_name}.

4.0.1
Stale Edge Node Entry Found Critical manager

Stale entries found for Edge Node.
When event detected: "The delete operation for Edge Node {transport_node_name} with UUID {entity_id} could not be completed successfully. A few stale entries may be present in the system. If the stale entry of this Edge Node is not deleted, it could lead to duplicate IPs getting assigned to newly deployed Edge Nodes and can impact the datapath. "
When event resolved: "All stale entries for the Edge node {entity_id} is cleared now. "

Please follow the KB article to clear the stale entries for the Edge Node {transport_node_name} with UUID {entity_id}.

4.1.1
Uplink fp-eth Interface Mismatch During Replacement Critical manager

Uplinks to fp-eth interfaces mismatch.
When event detected: "The mapping of uplinks to fp-eth interfaces {old_fp_eth_list} for the Edge node {transport_node_name} with UUID {entity_id} is not present in the new bare metal Edge fp-eth interfaces {new_fp_eth_list}. "
When event resolved: "The mismatch between uplinks to fp-eth interfaces is resolved now. "

Update the mapping of uplinks to fp-eth interfaces {old_fp_eth_list} via UI or API - PUT https://<manager-ip>/api/v1/transport-nodes/<tn-id> as per the new fp-eth interfaces {new_fp_eth_list}.

4.1.1

Edge Cluster Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Edge Cluster Member Relocate Failure Critical manager

Edge cluster member relocate failure alarm
When event detected: "The operation on Edge cluster {edge_cluster_id} to relocate all service context failed for Edge cluster member index {member_index_id} with Transport node ID {transport_node_id} "
When event resolved: "Edge node with {transport_node_id} relocation failure has been resolved now. "

Review the available capacity for the Edge cluster. If more capacity is required, scale your Edge cluster. Retry the relocate Edge cluster member operation.

4.0.0

Edge Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Edge CPU Usage Very High Critical edge, public-cloud-gateway

Edge node CPU usage is very high.
When event detected: "The CPU usage on Edge node {transport_node_name} ({transport_node_address}) has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage on Edge node {transport_node_name} ({transport_node_address}) has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

Review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.

3.0.0
Edge CPU Usage High Medium edge, public-cloud-gateway

Edge node CPU usage is high.
When event detected: "The CPU usage on Edge node {transport_node_name} ({transport_node_address}) has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage on Edge node {transport_node_name} ({transport_node_address}) has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.

3.0.0
Edge Memory Usage Very High Critical edge, public-cloud-gateway

Edge node memory usage is very high.
When event detected: "The memory usage on Edge node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage on Edge node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

Review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.

3.0.0
Edge Memory Usage High Medium edge, public-cloud-gateway

Edge node memory usage is high.
When event detected: "The memory usage on Edge node {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage on Edge node {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Review the configuration, running services and sizing of this Edge node. Consider adjusting the Edge appliance form factor size or rebalancing services to other Edge nodes for the applicable workload.

3.0.0
Edge Disk Usage Very High Critical edge, public-cloud-gateway

Edge node disk usage is very high.
When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

Examine the partition with high usage and see if there are any unexpected large files that can be removed.

3.0.0
Edge Disk Usage High Medium edge, public-cloud-gateway

Edge node disk usage is high.
When event detected: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage for the Edge node disk partition {disk_partition_name} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Examine the partition with high usage and see if there are any unexpected large files that can be removed.

3.0.0
Edge Datapath CPU Very High Critical edge, autonomous-edge, public-cloud-gateway

Edge node datapath CPU usage is very high.
When event detected: "The datapath CPU usage on Edge node {entity_id} has reached {datapath_resource_usage}% which is at or above the very high threshold for at least two minutes. "
When event resolved: "The CPU usage on Edge node {entity_id} has reached below the very high threshold. "

Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters.

3.0.0
Edge Datapath CPU High Medium edge, autonomous-edge, public-cloud-gateway

Edge node datapath CPU usage is high.
When event detected: "The datapath CPU usage on Edge node {entity_id} has reached {datapath_resource_usage}% which is at or above the high threshold for at least two minutes. "
When event resolved: "The CPU usage on Edge node {entity_id} has reached below the high threshold. "

Review the CPU statistics on the Edge node by invoking the NSX CLI command get dataplane cpu stats to show packet rates per CPU core. Higher CPU usage is expected with higher packet rates. Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters.

3.0.0
Edge Datapath Configuration Failure High edge, autonomous-edge, public-cloud-gateway

Edge node datapath configuration failed.
When event detected: "Failed to enable the datapath on the Edge node after three attempts. "
When event resolved: "Datapath on the Edge node has been successfully enabled. "

Ensure the Edge node's connectivity to the Manager node is healthy. From the Edge node's NSX CLI, invoke the command get services to check the health of services. If the dataplane service is stopped, invoke the command start service dataplane to start it.

3.0.0
Edge Datapath Cryptodrv Down Critical edge, autonomous-edge, public-cloud-gateway

Edge node crypto driver is down.
When event detected: "Edge node crypto driver {edge_crypto_drv_name} is down. "
When event resolved: "Edge node crypto driver {edge_crypto_drv_name} is up. "

Upgrade the Edge node as needed.

3.0.0
Edge Datapath Mempool High Medium edge, autonomous-edge, public-cloud-gateway

Edge node datapath mempool is high.
When event detected: "The datapath mempool usage for {mempool_name} on Edge node {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The datapath mempool usage for {mempool_name} on Edge node {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Log in as the root user and invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl mempool/show and edge-appctl -t /var/run/vmware/edge/dpd.ctl memory/show malloc_heap to check DPDK memory usage.

3.0.0
Edge Global ARP Table Usage High Medium edge, autonomous-edge, public-cloud-gateway

Edge node global ARP table usage is high.
When event detected: "Global ARP table usage on Edge node {entity_id} has reached {datapath_resource_usage}% which is above the high threshold for over two minutes. "
When event resolved: "Global ARP table usage on Edge node {entity_id} has reached below the high threshold. "

Log in as the root user and invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/show and check if neigh cache usage is normal. If it is normal, invoke the command edge-appctl -t /var/run/vmware/edge/dpd.ctl neigh/set_param max_entries to increase the ARP table size.

3.0.0
Edge NIC Out Of Receive Buffer Medium edge, autonomous-edge, public-cloud-gateway

Edge node NIC is out of RX ring buffers temporarily.
When event detected: "Edge NIC {edge_nic_name} receive ring buffer has overflowed by {rx_ring_buffer_overflow_percentage}% on Edge node {entity_id}. The missed packet count is {rx_misses} and processed packet count is {rx_processed}. "
When event resolved: "Edge NIC {edge_nic_name} receive ring buffer usage on Edge node {entity_id} is no longer overflowing. "

Run the NSX CLI command get dataplane cpu stats on the edge node and check:
1. If cpu usage is high, i.e., > 90%, then take multiple samples of logical router interfaces stats using the command `get logical-router interface stats` and if IPSec tunnel is enabled in the topology, then check IPsec tunnel stats using the command `get ipsecvpn tunnel stats`. Then analyze the stats to see if majority of traffic is fragmented packets or ipsec packets. If yes, then it is expected behavior. If not, datapath is probably busy with other operations. If this alarm lasts more than 2-3 minutes, contact VMware Support.
2. If cpu usage is not high, i.e., < 90%, then check if rx pps is high using the command get dataplane cpu stats (just to make sure the traffic rate is increasing). Then increase the ring size by 1024 using the command set dataplane ring-size rx <ring-size>. NOTE - The continuous increase of ring size by 1024 factor can lead to some performance issues. If even after increasing the ring size, the issue persists then it is an indication that edge needs a larger form factor deployment to accommodate the traffic.
3. If the alarm keeps on flapping i.e., triggers and resolves very soon, then it is due to bursty traffic. In this case check if rx pps as described above, if it is not high during the alarm active period then contact VMware Support. If pps is high it confirms bursty traffic. Consider suppressing the alarm. NOTE - There is no specific benchmark to decide what is regarded as a high pps value. It depends on infrastructure and type of traffic. The comparison can be made by noting down when alarm is inactive and when it is active.

3.0.0
Edge NIC Out Of Transmit Buffer Critical edge, autonomous-edge, public-cloud-gateway

Edge node NIC is out of TX ring buffers temporarily.
When event detected: "Edge NIC {edge_nic_name} transmit ring buffer has overflowed by {tx_ring_buffer_overflow_percentage}% on Edge node {entity_id}. The missed packet count is {tx_misses} and processed packet count is {tx_processed}. "
When event resolved: "Edge NIC {edge_nic_name} transmit ring buffer usage on Edge node {entity_id} is no longer overflowing. "

1. If a lot of VMs are accommodated along with edge by the hypervisor then edge VM might not get time to run, hence the packets might not be retrieved by hypervisor. Then probably migrating the edge VM to a host with fewer VMs.
2. Increase the ring size by 1024 using the command `set dataplane ring-size tx <ring-size>`. If even after increasing the ring size, the issue persists then contact VMware Support as the ESX side transmit ring buffer might be of lower value. If there is no issue on ESX side, it indicates the edge needs to be scaled to a larger form factor deployment to accommodate the traffic.
3. If the alarm keeps on flapping, i.e., triggers and resolves very soon, then it is due to bursty traffic. In this case check if tx pps using the command get dataplane cpu stats. If it is not high during the alarm active period then contact VMware Support. If pps is high it confirms bursty traffic. Consider suppressing the alarm. NOTE - There is no specific benchmark to decide what is regarded as a high pps value. It depends on infrastructure and type of traffic. The comparison can be made by noting down when alarm is inactive and when it is active.

3.0.0
Edge NIC Transmit Queue Overflow Critical edge, autonomous-edge, public-cloud-gateway

Edge node NIC transmit queue has overflowed temporarily.
When event detected: "Edge NIC {edge_nic_name} transmit queue {tx_queue_id} has overflowed by {tx_queue_overflow_percentage}% on Edge node {entity_id}. The missed packet count is {tx_misses} and processed packet count is {tx_processed}. "
When event resolved: "Edge NIC {edge_nic_name} transmit queue {tx_queue_id} on Edge node {entity_id} is no longer overflowing. "

1. If a lot of VMs are accommodated along with edge by the hypervisor then edge VM might not get time to run, hence the packets might not be retrieved by hypervisor. Then probably migrating the edge VM to a host with fewer VMs.
2. Increase the ring size by 1024 using the command `set dataplane ring-size tx <ring-size>`. If even after increasing the ring size, the issue persists then contact VMware Support as the ESX side transmit ring buffer might be of lower value. If there is no issue on ESX side, it indicates the edge needs to be scaled to a larger form factor deployment to accommodate the traffic.
3. If the alarm keeps on flapping, i.e., triggers and resolves very soon, then it is due to bursty traffic. In this case check if tx pps using the command get dataplane cpu stats. If it is not high during the alarm active period then contact VMware Support. If pps is high it confirms bursty traffic. Consider suppressing the alarm. NOTE - There is no specific benchmark to decide what is regarded as a high pps value. It depends on infrastructure and type of traffic. The comparison can be made by noting down when alarm is inactive and when it is active.

4.1.1
Edge NIC Link Status Down Critical edge, autonomous-edge, public-cloud-gateway

Edge node NIC link is down.
When event detected: "Edge node NIC {edge_nic_name} link is down. "
When event resolved: "Edge node NIC {edge_nic_name} link is up. "

On the Edge node confirm if the NIC link is physically down by invoking the NSX CLI command get interfaces. If it is down, verify the cable connection.

3.0.0
Storage Error Critical edge, autonomous-edge, public-cloud-gateway

Edge node disk is read-only.
When event detected: "The following disk partitions on the Edge node are in read-only mode: {disk_partition_name} "
When event resolved: "The following disk partitions on the Edge node have recovered from read-only mode: {disk_partition_name} "

Examine the read-only partition to see if reboot resolves the issue or the disk needs to be replaced. Contact GSS for more information.

3.0.1
Datapath Thread Deadlocked Critical edge, autonomous-edge, public-cloud-gateway

Edge node's datapath thread is in deadlock condition.
When event detected: "Edge node datapath thread {edge_thread_name} is deadlocked. "
When event resolved: "Edge node datapath thread {edge_thread_name} is free from deadlock. "

Restart the dataplane service by invoking the NSX CLI command restart service dataplane.

3.1.0
Edge Datapath NIC Throughput Very High Critical edge, autonomous-edge, public-cloud-gateway

Edge node datapath NIC throughput is very high.
When event detected: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is at or above the very high threshold value of {nic_throughput_threshold}%. "
When event resolved: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is below the very high threshold value of {nic_throughput_threshold}%. "

Examine the traffic thoughput levels on the NIC and determine whether configuration changes are needed. The 'get dataplane thoughput <seconds>' command can be used to monitor throughput.

3.2.0
Edge Datapath NIC Throughput High Medium edge, autonomous-edge, public-cloud-gateway

Edge node datapath NIC throughput is high.
When event detected: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is at or above the high threshold value of {nic_throughput_threshold}%. "
When event resolved: "The datapath NIC throughput for {edge_nic_name} on Edge node {entity_id} has reached {nic_throughput}% which is below the high threshold value of {nic_throughput_threshold}%. "

Examine the traffic thoughput levels on the NIC and determine whether configuration changes are needed. The 'get dataplane thoughput <seconds>' command can be used to monitor throughput.

3.2.0
Failure Domain Down Critical edge, public-cloud-gateway

All members of failure domain are down.
When event detected: "All members of failure domain {transport_node_id} are down. "
When event resolved: "All members of failure domain {transport_node_id} are reachable. "

1. On the Edge node identified by {transport_node_id}, check the connectivity to the management and control planes by invoking the NSX CLI command get managers and get controllers.
2. Invoke the NSX CLI command get interface eth0 to check the management interface status.
3. Invoke the CLI get services to check the core services status like dataplane/local-controller/nestdb/router, etc.
4. Inspect the /var/log/syslog to find the suspecting error.
5. Reboot the Edge node.

3.2.0
Micro Flow Cache Hit Rate Low Medium edge, autonomous-edge, public-cloud-gateway

Micro Flow Cache hit rate decreases and Datapath CPU is high.
When event detected: "Micro Flow Cache hit rate on Edge node {entity_id} has decreased below the specified threshold of {flow_cache_threshold}% for core {core_id}, and the Datapath CPU usage has increased for the last 30 minutes. "
When event resolved: "Flow Cache hit rate is in the normal range. "

The Cache Flow hit rate has decreased for the last 30 minutes which is an indication that there may be degradation on Edge performance. The traffic will continue to be forwarded and you may not experience any issues. Check the datapath CPU utilization for Edge {entity_id} core {core_id} if it is high for the last 30 minutes. The Edge will have low flow-cache hit rate when there are continuously new flows getting created because the first packet of any new flow will be used to setup to flow-cache for fast path processing. You may want to increase your Edge appliance size or increase the number of Edge nodes used for Active/Active Gateways.

3.2.2
Mega Flow Cache Hit Rate Low Medium edge, autonomous-edge, public-cloud-gateway

Mega Flow Cache hit rate decreases and Datapath CPU is high.
When event detected: "Mega Flow Cache hit rate on Edge node {entity_id} has decreased below the specified threshold of {flow_cache_threshold}% for core {core_id}, and the Datapath CPU usage has increased for the last 30 minutes. "
When event resolved: "Flow Cache hit rate is in the normal range. "

The Cache Flow hit rate has decreased for the last 30 minutes which is an indication that there may be degradation on Edge performance. The traffic will continue to be forwarded and you may not experience any issues. Check the datapath CPU utilization for Edge {entity_id} core {core_id} if it is high for the last 30 minutes. The Edge will have low flow-cache hit rate when there are continuously new flows getting created because the first packet of any new flow will be used to setup to flow-cache for fast path processing. You may want to increase your Edge appliance size or increase the number of Edge nodes used for Active/Active Gateways.

3.2.2
Flow Cache Deactivated Critical edge, autonomous-edge, public-cloud-gateway

Flow cache deactivated.
When event detected: "Flow cache on Edge Transport Node {transport_node_name} with UUID {entity_id} is deactivated. "
When event resolved: "Flow cache has now been activated on Edge node {transport_node_name}. "

Please make sure the Flow cache for the Edge Transport Node {entity_id} and {transport_node_name} is activated. Deactivating the flow cache will cause traffic to be forwarded through the CPU.

4.1.1

Endpoint Protection Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
EAM Status Down Critical manager

ESX Agent Manager (EAM) service on a compute manager is down.
When event detected: "ESX Agent Manager (EAM) service on compute manager {entity_id} is down. "
When event resolved: "ESX Agent Manager (EAM) service on compute manager {entity_id} is either up or compute manager {entity_id} has been removed. "

Start the ESX Agent Manager (EAM) service. SSH into vCenter and invoke the command service vmware-eam start.

3.0.0
Partner Channel Down Critical esx

Host module and Partner SVM connection is down.
When event detected: "The connection between host module and Partner SVM {entity_id} is down. "
When event resolved: "The connection between host module and Partner SVM {entity_id} is up. "

Refer to https://kb.vmware.com/s/article/85844 and make sure that Partner SVM {entity_id} is re-connected to the host module. You can also run the 'NxgiPlatform' Runbook on this particular Transport node for help in troubleshooting.

3.0.0

Federation Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Rtep BGP Down High edge, autonomous-edge, public-cloud-gateway

RTEP BGP neighbor down.
When event detected: "RTEP (Remote Tunnel Endpoint) BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is down. Reason: {failure_reason}. "
When event resolved: "RTEP (Remote Tunnel Endpoint) BGP session from source IP {bgp_source_ip} to remote location {remote_site_name} neighbor IP {bgp_neighbor_ip} is established. "

1. Invoke the NSX CLI command get logical-routers on the affected edge node.
2. Switch to REMOTE_TUNNEL_VRF context.
3. Invoke the NSX CLI command get bgp neighbor summary to check the BGP neighbor status.
4. Alternatively, invoke the NSX API GET /api/v1/transport-nodes/<transport-node-id>/inter-site/bgp/summary to get the BGP neighbor status.
5. Invoke the NSX CLI command get interfaces and check if the correct RTEP IP address is assigned to the interface with name remote-tunnel-endpoint.
6. Check if the ping is working successfully between assigned RTEP IP address ({bgp_source_ip}) and the remote location {remote_site_name} neighbor IP {bgp_neighbor_ip}.
7. Check /var/log/syslog for any errors related to BGP.
8. Invoke the NSX API GET or PUT /api/v1/transport-nodes/<transport-node-id> to get/update remote_tunnel_endpoint configuration on the edge node. This will update the RTEP IP assigned to the affected edge node. If the reason indicates Edge is not ready, check why the Edge node is not in good state.
1. Invoke the NSX CLI command get edge-cluster status to check reason why Edge node might be down.
2. Invoke the NSX CLI commands get bfd-config and get bfd-sessions to check if BFD is running well.
3. Check any Edge health related alarms to get more information.

3.0.1
LM To LM Synchronization Warning Medium manager

Synchronization between remote locations failed for more than 3 minutes.
When event detected: "The synchronization between {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for more than 3 minutes. "
When event resolved: "Remote locations {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized. "

1. Invoke the NSX CLI command get site-replicator remote-sites to get connection state between the remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of leader resolution. In this case, wait for around 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step.
2. Check the connectivity from Local Manager (LM) in location {site_name}({site_id}) to the LMs in location {remote_site_name}({remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity. If there are no physical network connectivity issues, try the next step.
3. Check the /var/log/cloudnet/nsx-ccp.log file on the Manager nodes in the local cluster in location {site_name}({site_id}) that triggered the alarm to see if there are any cross-site communication errors. In addition, also look for errors being logged by the nsx-appl-proxy subcomponent within /var/log/syslog.

3.0.1
LM To LM Synchronization Error High manager

Synchronization between remote locations failed for more than 15 minutes.
When event detected: "The synchronization between {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for more than 15 minutes. "
When event resolved: "Remote sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized. "

1. Invoke the NSX CLI command get site-replicator remote-sites to get connection state between the remote locations. If a remote location is connected but not synchronized, it is possible that the location is still in the process of leader resolution. In this case, wait for around 10 seconds and try invoking the CLI again to check for the state of the remote location. If a location is disconnected, try the next step.
2. Check the connectivity from Local Manager (LM) in location {site_name}({site_id}) to the LMs in location {remote_site_name}({remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity. If there are no physical network connectivity issues, try the next step.
3. Check the /var/log/cloudnet/nsx-ccp.log file on the Manager nodes in the local cluster in location {site_name}({site_id}) that triggered the alarm to see if there are any cross-site communication errors. In addition, also look for errors being logged by the nsx-appl-proxy subcomponent within /var/log/syslog.

3.0.1
Rtep Connectivity Lost High manager

RTEP location connectivity lost.
When event detected: "Edge node {transport_node_name} lost RTEP (Remote Tunnel Endpoint) connectivity with remote location {remote_site_name}. "
When event resolved: "Edge node {transport_node_name} has restored RTEP (Remote Tunnel Endpoint) connectivity with remote location {remote_site_name}. "

1. Invoke the NSX CLI command get logical-routers on the affected edge node {transport_node_name}.
2. Switch to REMOTE_TUNNEL_VRF context.
3. Invoke the NSX CLI command get bgp neighbor summary to check the BGP neighbor status.
4. Alternatively, invoke the NSX API GET /api/v1/transport-nodes/<transport-node-id>/inter-site/bgp/summary to get the BGP neighbor status.
5. Invoke the NSX CLI command get interfaces and check if the correct RTEP IP address is assigned to the interface with name remote-tunnel-endpoint.
6. Check if the ping is working successfully between assigned RTEP IP address and the RTEP IP addresses on the remote location {remote_site_name}.
7. Check /var/log/syslog for any errors related to BGP.
8. Invoke the NSX API GET or PUT /api/v1/transport-nodes/<transport-node-id> to get/update remote_tunnel_endpoint configuration on the edge node. This will update the RTEP IP assigned to the affected edge node {transport_node_name}.

3.0.2
GM To GM Split Brain Critical global-manager

Multiple Global Manager nodes are active at the same time.
When event detected: "Multiple Global Manager nodes are active: {active_global_managers}. Only one Global Manager node must be active at any time. "
When event resolved: "Global Manager node {active_global_manager} is the only active Global Manager node now. "

Configure only one Global Manager node as active and all other Global Manager nodes as standby.

3.1.0
GM To GM Latency Warning Medium global-manager

Latency between Global Managers is higher than expected for more than 2 minutes
When event detected: "Latency is higher than expected between Global Managers {from_gm_path} and {to_gm_path}. "
When event resolved: "Latency is below expected levels between Global Managers {from_gm_path} and {to_gm_path}. "

Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) via ping. If they are not pingable, check for flakiness in WAN connectivity.

3.2.0
GM To GM Synchronization Warning Medium global-manager

Active Global Manager to Standby Global Manager cannot synchronize
When event detected: "Active Global Manager {from_gm_path} to Standby Global Manager {to_gm_path} can not synchronize. "
When event resolved: "Synchronization from active Global Manager {from_gm_path} to standby {to_gm_path} is healthy. "

Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) via ping.

3.2.0
GM To GM Synchronization Error High global-manager

Active Global Manager to Standby Global Manager cannot synchronize for more than 5 minutes
When event detected: "Active Global Manager {from_gm_path} to Standby Global Manager {to_gm_path} cannot synchronize for more than 5 minutes. "
When event resolved: "Synchronization from active Global Manager {from_gm_path} to standby {to_gm_path} is healthy. "

Check the connectivity from Global Manager {from_gm_path}({site_id}) to the Global Manager {to_gm_path}({remote_site_id}) via ping.

3.2.0
GM To LM Synchronization Warning Medium global-manager, manager

Data synchronization between Global Manager (GM) and Local Manager (LM) failed.
When event detected: "Data synchronization between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for the {flow_identifier}. Reason: {sync_issue_reason} "
When event resolved: "Sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized for {flow_identifier}. "

1. Check the network connectivity between remote site and local site via ping.
2. Ensure port TCP/1236 traffic is allowed between the local and remote sites.
3. Ensure the async-replicator service is running on both local and remote sites. Invoke the GET /api/v1/node/services/async_replicator/status NSX API or the get service async_replicator NSX CLI command to determine if the service is running. If not running, invoke the POST /api/v1/node/services/async_replicator?action=restart NSX API or the restart service async_replicator NSX CLI to restart the service.
4. Check /var/log/async-replicator/ar.log to see if there are errors reported.

3.2.0
GM To LM Synchronization Error High global-manager, manager

Data synchronization between Global Manager (GM) and Local Manager (LM) failed for an extended period.
When event detected: "Data synchronization between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) failed for the {flow_identifier} for an extended period. Reason: {sync_issue_reason}. "
When event resolved: "Sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) are now synchronized for {flow_identifier}. "

1. Check the network connectivity between remote site and local site via ping.
2. Ensure port TCP/1236 traffic is allowed between the local and remote sites.
3. Ensure the async-replicator service is running on both local and remote sites. Invoke the GET /api/v1/node/services/async_replicator/status NSX API or the get service async_replicator NSX CLI command to determine if the service is running. If not running, invoke the POST /api/v1/node/services/async_replicator?action=restart NSX API or the restart service async_replicator NSX CLI to restart the service.
4. Check /var/log/async-replicator/ar.log to see if there are errors reported.
5. Collect a support bundle and contact the NSX support team.

3.2.0
Queue Occupancy Threshold Exceeded Medium manager, global-manager

Queue occupancy size threshold exceeded warning.
When event detected: "Queue ({queue_name}) used for syncing data between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached size {queue_size} which is at or above the maximum threshold of {queue_size_threshold}%. "
When event resolved: "Queue ({queue_name}) used for syncing data between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached size {queue_size} which is below the maximum threshold of {queue_size_threshold}%. "

Queue size can exceed threshold due to communication issue with remote site or an overloaded system. Check system performance and /var/log/async-replicator/ar.log to see if there are any errors reported.

3.2.0
GM To LM Latency Warning Medium global-manager, manager

Latency between Global Manager and Local Manager is higher than expected for more than 2 minutes.
When event detected: "Latency between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached {latency_value} which is above the threshold value of {latency_threshold}. "
When event resolved: "Latency between sites {site_name}({site_id}) and {remote_site_name}({remote_site_id}) has reached {latency_value} which below the threshold value of {latency_threshold}. "

1. Check the network connectivity between remote site and local site via ping.
2. Ensure port TCP/1236 traffic is allowed between the local and remote sites.
3. Check /var/log/async-replicator/ar.log to see if there are errors reported.

3.2.0
LM Restore While Config Import In Progress High global-manager

Local Manager is restored while config import is in progress on Global Manager.
When event detected: "Config import from site {site_name}({site_id}) is in progress. However site {site_name}({site_id}) is restored from backup by the administrator leaving it in an inconsistent state. "
When event resolved: "Config inconsistency at site {site_name}({site_id}) is resolved. "

1. Log in to NSX Global Manager appliance CLI.
2. Switch to root.
3. Invoke the NSX API DELETE http://localhost:64440 /gm/api/v1/infra/sites/<site-name>/onboarding/status in local mode, this will delete site on-boarding status for Global Manager.
4. Re-initiate config on-boarding again.

3.2.0

Gateway Firewall Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
IP Flow Count High Medium edge, public-cloud-gateway

The gateway firewall flow table usage for IP traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit.
When event detected: "Gateway firewall flow table usage for IP on logical router {entity_id} has reached {firewall_ip_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "
When event resolved: "Gateway firewall flow table usage for non IP flows on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. "

Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for IP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node.

3.1.3
IP Flow Count Exceeded Critical edge, public-cloud-gateway

The gateway firewall flow table for IP traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit.
When event detected: "Gateway firewall flow table usage for IP traffic on logical router {entity_id} has reached {firewall_ip_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "
When event resolved: "Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. "

Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for IP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node.

3.1.3
UDP Flow Count High Medium edge, public-cloud-gateway

The gateway firewall flow table usage for UDP traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit.
When event detected: "Gateway firewall flow table usage for UDP on logical router {entity_id} has reached {firewall_udp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "
When event resolved: "Gateway firewall flow table usage for UDP on logical router {entity_id} has reached below the high threshold. "

Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for UDP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node.

3.1.3
UDP Flow Count Exceeded Critical edge, public-cloud-gateway

The gateway firewall flow table for UDP traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit.
When event detected: "Gateway firewall flow table usage for UDP traffic on logical router {entity_id} has reached {firewall_udp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "
When event resolved: "Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold. "

Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for UDP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node.

3.1.3
ICMP Flow Count High Medium edge, public-cloud-gateway

The gateway firewall flow table usage for ICMP traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit.
When event detected: "Gateway firewall flow table usage for ICMP on logical router {entity_id} has reached {firewall_icmp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "
When event resolved: "Gateway firewall flow table usage for ICMP on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. "

Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for ICMP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node.

3.1.3
ICMP Flow Count Exceeded Critical edge, public-cloud-gateway

The gateway firewall flow table for ICMP traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit.
When event detected: "Gateway firewall flow table usage for ICMP traffic on logical router {entity_id} has reached {firewall_icmp_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "
When event resolved: "Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. "

Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for ICMP flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node.

3.1.3
Tcp Half Open Flow Count High Medium edge, public-cloud-gateway

The gateway firewall flow table usage for TCP half-open traffic is high. New flows will be dropped by Gateway firewall when usage reaches the maximum limit.
When event detected: "Gateway firewall flow table usage for TCP on logical router {entity_id} has reached {firewall_halfopen_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "
When event resolved: "Gateway firewall flow table usage for TCP half-open on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. "

Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for TCP half-open flow. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node.

3.1.3
Tcp Half Open Flow Count Exceeded Critical edge, public-cloud-gateway

The gateway firewall flow table for TCP half-open traffic has exceeded the set threshold. New flows will be dropped by Gateway firewall when usage reaches the maximum limit.
When event detected: "Gateway firewall flow table usage for TCP half-open traffic on logical router {entity_id} has reached {firewall_halfopen_flow_usage}% which is at or above the high threshold value of {system_usage_threshold}%. New flows will be dropped by Gateway firewall when usage reaches the maximum limit. "
When event resolved: "Gateway firewall flow table usage on logical router {entity_id} has reached below the high threshold value of {system_usage_threshold}%. "

Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> interface stats | json by using right interface uuid and check flow table usage for TCP half-open flows. Check traffic flows going through the gateway is not a DOS attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider increasing the alarm threshold or route new traffic to another Edge node.

3.1.3

Groups Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Group Size Limit Exceeded Medium manager

The total number of translated group elements has exceeded the maximum limit.
When event detected: "Group {group_id} has at least {group_size} translated elements which is at or greater than the maximum numbers limit of {group_max_number_limit}. This can result in long processing times and can lead to timeouts and outages. The current count for each element type is as follows. IP sets:{ip_count}, MAC sets:{mac_count}, VIFS:{vif_count}, Logical switch ports:{lsp_count}, Logical router ports:{lrp_count}, AdGroups:{sid_count}. "
When event resolved: "Total number of elements in group {group_id} is below the maximum limit of {group_max_number_limit}. "

1. Consider adjusting group elements in oversized group {group_id}.
2. Consider splitting oversized group {group_id} to multiple smaller groups and distributing members of oversized group to these groups.

4.1.0
Active Directory Groups Modified Medium manager

Active Directory Groups are modified on AD server.
When event detected: "Group {policy_group_name} contains an Active Directory Group member {old_base_distinguished_name} that is renamed on the Active Directory server with {new_base_distinguished_name}. Make sure the group has a valid Identity Group Member. "
When event resolved: "Group {policy_group_name} is updated with valid Active Directory Group member. "

In the NSX UI, navigate to the Inventory | Groups tab to update the group definition of the applicable group with the new base distinguished name. Make sure the group has valid identity group members.

4.1.2

High Availability Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Tier0 Gateway Failover High edge, autonomous-edge, public-cloud-gateway

A tier0 gateway has failed over.
When event detected: "The tier0 gateway {entity_id} failover from {previous_gateway_state} to {current_gateway_state}, service-router {service_router_id}. Reason: {failover_reason}. "
When event resolved: "The tier0 gateway {entity_id} is now up. "

Invoke the NSX CLI command get logical-router <service_router_id> to identify the tier0 service-router vrf ID. Switch to the vrf context by invoking vrf <vrf-id> then invoke get high-availability status to determine the service that is down.

3.0.0
Tier1 Gateway Failover High edge, autonomous-edge, public-cloud-gateway

A tier1 gateway has failed over.
When event detected: "The tier1 gateway {entity_id} failover from {previous_gateway_state} to {current_gateway_state}, service-router {service_router_id}. Reason: {failover_reason} "
When event resolved: "The tier1 gateway {entity_id} is now up. "

Invoke the NSX CLI command get logical-router <service_router_id> to identify the tier1 service-router vrf ID. Switch to the vrf context by invoking vrf <vrf-id> then invoke get high-availability status to determine the service that is down.

3.0.0
Tier0 Service Group Failover High edge, public-cloud-gateway

Service-group does not have an active instance.
When event detected: "Service-group cluster {entity_id} currently does not have an active instance. It is in state {ha_state} (where 0 is down, 1 is standby and 2 is active) on Edge node {transport_node_id} and in state {ha_state2} on Edge node {transport_node_id2}. Reason: {failover_reason}. "
When event resolved: "Tier0 service-group cluster {entity_id} now has one active instance on Edge node {transport_node_id}. "

Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for reason for a service-group leaving active state.

4.0.1
Tier1 Service Group Failover High edge, public-cloud-gateway

Service-group does not have an active instance.
When event detected: "Service-group cluster {entity_id} currently does not have an active instance. It is in state {ha_state} (where 0 is down, 1 is standby and 2 is active) on Edge node {transport_node_id} and in state {ha_state2} on Edge node {transport_node_id2}. Reason: {failover_reason}. "
When event resolved: "Tier1 service-group cluster {entity_id} now has one active instance on Edge node {transport_node_id}. "

Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for reason for a service-group leaving active state.

4.0.1
Tier0 Service Group Reduced Redundancy Medium edge, public-cloud-gateway

A standby instance in a service-group has failed.
When event detected: "Service-group cluster {entity_id} attached to Tier0 service-router {service_router_id} on Edge node {transport_node_id} has failed. As a result, the service-group cluster currently does not have a standby instance. Reason: {failover_reason} "
When event resolved: "Service-group cluster {entity_id} is in state {ha_state} (where 0 is down, 1 is standby and 2 is active) on Edge node {transport_node_id} and state {ha_state2} on Edge node {transport_node_id2}. "

Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for failure reason for a previously standby service-group.

4.0.1
Tier1 Service Group Reduced Redundancy Medium edge, public-cloud-gateway

A standby instance in a service-group has failed.
When event detected: "Service-group cluster {entity_id} attached to Tier1 service-router {service_router_id} on Edge node {transport_node_id} has failed. As a result, the service-group cluster currently does not have a standby instance. Reason: {failover_reason} "
When event resolved: "Service-group cluster {entity_id} is in state {ha_state} (where 0 is down, 1 is standby and 2 is active) on Edge node {transport_node_id} and state {ha_state2} on Edge node {transport_node_id2}. "

Invoke the NSX CLI command get logical-router <service_router_id> service_group to check all service-groups configured under a given service-router. Examine the output for failure reason for a previously standby service-group.

4.0.1

Identity Firewall Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Connectivity To LDAP Server Lost Critical manager

Connectivity to LDAP server is lost.
When event detected: "The connectivity to LDAP server {ldap_server} is lost. "
When event resolved: "The connectivity to LDAP server {ldap_server} is restored. "

Check
1. The LDAP server is reachable from NSX nodes.
2. The LDAP server details are configured correctly in NSX.
3. The LDAP server is running correctly.
4. There are no firewalls blocking access between the LDAP server and NSX nodes. Afer the issue is fixed, use TEST CONNECTION in NSX UI under Identity Firewall AD to test the connection.

3.1.0
Error In Delta Sync Critical manager

Errors occurred while performing delta sync.
When event detected: "Errors occurred while performing delta sync with {directory_domain}. "
When event resolved: "No errors occurred while performing delta sync with {directory_domain}. "

1. Check if there are any connectivity to LDAP server lost alarms.
2. Find the error details in /var/log/syslog. Around the alarm trigger time, search for text: Error happened when synchronize LDAP objects.
3. Check with AD administrator if there are any recent AD changes which may cause the errors.
4. If the errors persist, collect the technical support bundle and contact VMware technical support.

3.1.0

IDS IPS Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
IDPS Signature Bundle Download Failure Medium manager

Unable to download IDPS signature bundle from NTICS.
When event detected: "Unable to download IDPS signature bundle from NTICS. "
When event resolved: "IDPS signature bundle download from NTICS was successful. "

Check if there is internet connectivity from NSX Manager to NTICS.

4.1.1

Infrastructure Communication Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Edge Tunnels Down Critical edge, public-cloud-gateway

An Edge node's tunnel status is down.
When event detected: "The overall tunnel status of Edge node {entity_id} is down. "
When event resolved: "The tunnels of Edge node {entity_id} have been restored. "

Invoke the NSX CLI command get tunnel-ports to get all tunnel ports, then check each tunnel's stats by invoking NSX CLI command get tunnel-port <UUID> stats to check if there are any drops. Also check /var/log/syslog if there are tunnel related errors.

3.0.0
GRE Tunnel Down Critical edge, autonomous-edge, public-cloud-gateway

GRE tunnel down.
When event detected: "GRE tunnel on Edge Transport Node {transport_node_name} with tunnel UUID {tunnel_uuid} is down. The traffic that is to be sent through the tunnel will be impacted. "
When event resolved: "GRE tunnel on Edge Transport Node {transport_node_name} with UUID {tunnel_uuid} is up. "

GRE tunnel goes down when GRE keepalives are not received for dead multiplier times. Check connectivity of GRE endpoints.

4.1.2

Infrastructure Service Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Service Status Unknown On DPU Critical dpu

Service's status on DPU is abnormal.
When event detected: "The service {service_name} on DPU {dpu_id} has been unresponsive for 10 seconds. "
When event resolved: "The service {service_name} on DPU {dpu_id} is responsive again. "

Verify {service_name} service on DPU {dpu_id} is still running by invoking /etc/init.d/{service_name} status. If the service is reported as running, it may need to get restarted which can be done by /etc/init.d/{service_name} restart. Rerun the status command to verify the service is now running. If restarting the service does not resolve the issue or if the issue reoccurs after a successful restart, contact VMware Support.

4.0.0
Service Status Unknown Critical esx, kvm, bms, edge, manager, public-cloud-gateway, global-manager

Service's status is abnormal.
When event detected: "The service {service_name} has been unresponsive for {heartbeat_threshold} seconds. "
When event resolved: "The service {service_name} is responsive again. "

Verify {service_name} service is still running by invoking /etc/init.d/{service_name} status. If the service is reported as running, it may need to get restarted which can be done by /etc/init.d/{service_name} restart. Rerun the status command to verify the service is now running. If the script /etc/init.d/{service_name} is unavailable, invoke systemctl status {service_name} and restart by systemctl restart {service_name} with root priviledges. If restarting the service does not resolve the issue or if the issue reoccurs after a successful restart, contact VMware Support.

3.1.0
Metrics Delivery Failure Critical esx, bms, edge, manager, public-cloud-gateway, global-manager

Failed to deliver metrics to the specified target.
When event detected: "Failed to deliver metrics from SHA to target {metrics_target_alias}({metrics_target_address}:{metrics_target_port}). "
When event resolved: "Metrics delivery to target {metrics_target_alias}({metrics_target_address}:{metrics_target_port}) recovered. "

User should perform the following checks in order to exclude the problem causing the failure: For NAPP: 1. Check if target address {metrics_target_address} and port {metrics_target_port} passed down to connect is the expected target, 2. Check if the certificate for the secure connection is correct by grep 'nsx-sha' {log_file} | grep 'NAPP Profile'(private key would be shielded), 3. Check if target {metrics_target_address} is reachable, 4. Check if obvious transmission failure could be observed in SHA by grep 'Failed to send one msg' {log_file}, For metrics mux: Note: {metrics_target_address} actually is the metrics mux which would bridge the metrics to the metrics instance which is the real target behind the VDP ingress. 1. Check if the picked manager {metrics_target_address} is onboarded, 2. Check if the certificate for the secure connection is correct by grep 'nsx-sha' {log_file} | grep 'Metrics Mux Profile'(private key would be shielded), 3. Check metrics agent status on the manager {metrics_target_address}, by /etc/init.d/nsx-metrics-agents status, 4. Check if obvious transmission failure could be observed in SHA by grep 'Failed to send one msg' {log_file}, Common part: 1. Check if ALLOW firewall rule is installed on the node by iptables -S OUTPUT | grep {metrics_target_port}(EDGE/Manager) or localcli network firewall ruleset list | grep nsx-sha-tsdb(ESX), 2. Restart SHA daemon to see if it could be solved by /etc/init.d/netopa restart(ESX) or /etc/init.d/nsx-sha restart(EDGE/Manager).

4.1.0
Edge Service Status Down (deprecated) Critical edge, autonomous-edge, public-cloud-gateway

Edge service is down for at least one minute.
When event detected: "The service {edge_service_name} is down for at least one minute. {service_down_reason} "
When event resolved: "The service {edge_service_name} is up. "

On the Edge node, verify the service hasn't exited due to an error by looking for core files in the /var/log/core directory. In addition, invoke the NSX CLI command get services to confirm whether the service is stopped. If so, invoke start service <service-name> to restart the service.

3.0.0
Edge Service Status Changed Medium edge, autonomous-edge, public-cloud-gateway

Edge service status has changed.
When event detected: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}. {service_down_reason} "
When event resolved: "The service {edge_service_name} changed from {previous_service_state} to {current_service_state}. "

On the Edge node, verify the service hasn't exited due to an error by looking for core files in the /var/log/core directory. In addition, invoke the NSX CLI command get services to confirm whether the service is stopped. If so, invoke start service <service-name> to restart the service.

3.0.0
Application Crashed Critical global-manager, autonomous-edge, bms, edge, esx, kvm, manager, public-cloud-gateway

Application has crashed and generated a core dump.
When event detected: "Application on NSX node {node_display_or_host_name} has crashed. The number of core files found is {core_dump_count}. Collect the Support Bundle including core dump files and contact VMware Support team. "
When event resolved: "All core dump files are withdrawn from system. "

Collect Support Bundle for NSX node {node_display_or_host_name} using NSX Manager UI or API. Note, core dumps can be set to move or copy into NSX Tech Support Bundle in order to remove or preserve the local copy on node. Copy of Support Bundle with core dump files is essential for VMware Support team to troubleshoot the issue and it is best recommended to save a latest copy of Tech Support Bundle including core dump files before removing core dump files from system. Refer KB article for more details.

4.0.0
Application Crashed On DPU Critical dpu

Application has crashed and generated a core dump on dpu.
When event detected: "Application on DPU {dpu_id} has crashed. The number of core files found is {core_dump_count}. Collect the Support Bundle including core dump files and contact VMware Support team. "
When event resolved: "All core dump files are withdrawn from system. "

Collect Support Bundle for DPU {dpu_id} using NSX Manager UI or API. Note, core dumps can be set to move or copy into NSX Tech Support Bundle in order to remove or preserve the local copy on node. Copy of Support Bundle with core dump files is essential for VMware Support team to troubleshoot the issue and it is best recommended to save a latest copy of Tech Support Bundle including core dump files before removing core dump files from system. Refer KB article for more details.

4.1.1
Compute Manager Lost Connectivity Critical manager, global-manager

Compute Manager connection status is down.
When event detected: "Connection status of Compute Manager {cm_name} having id {cm_id} is DOWN. "
When event resolved: "Connection status of Compute Manager {cm_name} having id {cm_id} is UP again. "

Check the errors present for Compute Manager {cm_name} having id {cm_id} and resolve the errors.

4.1.2

Intelligence Communication Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
TN Flow Exporter Disconnected (deprecated) High esx, kvm, bms

A Transport node is disconnected from its NSX Messaging Broker.
When event detected: "The flow exporter on Transport node {entity_id} is disconnected from its messaging broker {messaging_broker_info}. Data collection is affected. "
When event resolved: "The flow exporter on Transport node {entity_id} has reconnected to its messaging broker {messaging_broker_info}. "

Restart the messaging service if it is not running. Resolve the network connection failure between the Transport node flow exporter and its NSX messaging broker.

3.0.0

Intelligence Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
CPU Usage Very High (deprecated) Critical manager, intelligence

Intelligence node CPU usage is very high.
When event detected: "The CPU usage on Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage on Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%. "

Use the top command to check which processes have the most CPU usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

3.0.0
CPU Usage High (deprecated) Medium manager, intelligence

Intelligence node CPU usage is high.
When event detected: "The CPU usage on Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage on Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%. "

Use the top command to check which processes have the most CPU usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

3.0.0
Memory Usage Very High (deprecated) Critical manager, intelligence

Intelligence node memory usage is very high.
When event detected: "The memory usage on Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage on Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%. "

Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

3.0.0
Memory Usage High (deprecated) Medium manager, intelligence

Intelligence node memory usage is high.
When event detected: "The memory usage on Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage on Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%. "

Use the top command to check which processes have the most memory usages, and then check /var/log/syslog and these processes' local logs to see if there are any outstanding errors to be resolved.

3.0.0
Disk Usage Very High (deprecated) Critical manager, intelligence

Intelligence node disk usage is very high.
When event detected: "The disk usage of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%. "

Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed.

3.0.0
Disk Usage High (deprecated) Medium manager, intelligence

Intelligence node disk usage is high.
When event detected: "The disk usage of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%. "

Examine disk partition {disk_partition_name} and see if there are any unexpected large files that can be removed.

3.0.0
Data Disk Partition Usage Very High (deprecated) Critical manager, intelligence

Intelligence node data disk partition usage is very high.
When event detected: "The disk usage of disk partition /data on Intelligence node {intelligence_node_id} is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of disk partition /data on Intelligence node {intelligence_node_id} is below the very high threshold value of {system_usage_threshold}%. "

Stop NSX intelligence data collection until the disk usage is below the threshold. In the NSX UI, navigate to System | Appliances | NSX Intelligence Appliance. Then click ACTONS, Stop Collecting Data.

3.0.0
Data Disk Partition Usage High (deprecated) Medium manager, intelligence

Intelligence node data disk partition usage is high.
When event detected: "The disk usage of disk partition /data on Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of disk partition /data on Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold}%. "

Stop NSX intelligence data collection until the disk usage is below the threshold. Examine disk partition /data and see if there are any unexpected large files that can be removed.

3.0.0
Storage Latency High (deprecated) Medium manager, intelligence

Intelligence node storage latency is high.
When event detected: "The storage latency of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is above the high threshold value of {system_usage_threshold} milliseconds. "
When event resolved: "The storage latency of disk partition {disk_partition_name} on Intelligence node {intelligence_node_id} is below the high threshold value of {system_usage_threshold} milliseconds. "

Transient high storage latency may happen due to spike of I/O requests. If storage latency remains high for more than 30 minutes, consider deploying NSX Intelligence appliance in a low latency disk, or not sharing the same storage device with other VMs.

3.1.0
Node Status Degraded (deprecated) High manager, intelligence

Intelligence node status is degraded.
When event detected: "Intelligence node {intelligence_node_id} is degraded. "
When event resolved: "Intelligence node {intelligence_node_id} is running properly. "

Invoke the NSX API GET /napp/api/v1/platform/monitor/category/health to check which specific pod is down and the reason behind it. Invoke the following CLI command to restart the degraded service: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace>

3.0.0

IPAM Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
IP Block Usage Very High Medium manager

IP block usage is very high.
When event detected: "IP block usage of {intent_path} is very high. IP block nearing its total capacity, creation of subnet using IP block might fail. "
When event resolved: "IP block usage of {intent_path} is below threshold level. "

Review IP block usage. Use new IP block for resource creation or delete unused IP subnet from the IP block. To check subnet being used for IP Block. From NSX UI, navigate to Networking | IP Address pools | IP Address pools tab. Select IP pools where IP block being used, check Subnets and Allocated IPs column on UI. If no allocation has been used for the IP pool and it is not going to be used in future then delete subnet or IP pool. Use following API to check if IP block being used by IP pool and also check if any IP allocation done: To get configured subnets of an IP pool, invoke the NSX API GET /policy/api/v1/infra/ip-pools/<ip-pool>/ip-subnets To get IP allocations, invoke the NSX API GET /policy/api/v1/infra/ip-pools/<ip-pool>/ip-allocations Note: Deletion of IP pool/subnet should only be done if it does not have any allocated IPs and it is not going to be used in future.

3.1.2
IP Pool Usage Very High Medium manager

IP pool usage is very high.
When event detected: "IP pool usage of {intent_path} is very high. IP pool nearing its total capacity. Creation of entity/service depends on IP being allocated from IP pool might fail. "
When event resolved: "IP pool usage of {intent_path} is normal now. "

Review IP pool usage. Release unused ip allocations from IP pool or create new IP pool and use it. From NSX UI navigate to Networking | IP Address pools | IP Address pools tab. Select IP pools and check Allocated IPs column, this will show IPs allocated from the IP pool. If user see any IPs are not being used then those IPs can be released. To release unused IP allocations, invoke the NSX API DELETE /policy/api/v1/infra/ip-pools/<ip-pool>/ip-allocations/<ip-allocation>

3.1.2

Licenses Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
SHA Metering Plugin Down Critical manager

License SHA metering plugin on ESXi host is down or unhealthy.
When event detected: "The license SHA metering plugin on ESXi host {transport_node_name} ({transport_node_id}) is down or unhealthy for three consecutive days. Due to this, metering data from ESXi host is impacted. "
When event resolved: "The license SHA metering plugin on ESXi host {transport_node_name} ({transport_node_id}) is up or healthy again. "

To check license SHA metering plugin status, invoke the NSX API GET /api/v1/systemhealth/plugins/status/{transport_node_id}. If there is no data in response, it means the connection between NSX Manager and ESXi host is broken or SHA process in ESXi host is down. If there is plugin status in response, locate plugin status by name license_metering_monitor and check content in detail. To restore license SHA metering plugin in ESXi host, log into the ESXi host and restart SHA process by invoking the command /etc/init.d/netopa restart.

4.1.2
License Expired Critical global-manager, manager

A license has expired.
When event detected: "The {license_edition_type} license key ending with {displayed_license_key}, has expired. "
When event resolved: "The expired {license_edition_type} license key ending with {displayed_license_key}, has been removed, updated or is no longer about to expire. "

Add a new, non-expired license using the NSX UI by navigating to System | Licenses then click ADD and specify the key of the new license. The expired license should be deleted by checking the checkbox of the license, then click DELETE.

3.0.0
License Is About To Expire Medium global-manager, manager

A license is about to expired.
When event detected: "The {license_edition_type} license key ending with {displayed_license_key}, is about to expire. "
When event resolved: "The expiring {license_edition_type} license key ending with {displayed_license_key}, has been removed, updated or is no longer about to expire. "

The license is about to expire in several days. Plan to add a new, non-expiring license using the NSX UI by navigating to System | Licenses then click ADD and specify the key of the new license. The expired license should be deleted by checking the checkbox of the license, then click DELETE.

3.0.0

Load Balancer Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
LB CPU Very High Medium edge

Load balancer CPU usage is very high.
When event detected: "The CPU usage of load balancer {entity_id} is very high. The threshold is {system_usage_threshold}%. "
When event resolved: "The CPU usage of load balancer {entity_id} is low enough. The threshold is {system_usage_threshold}%. "

If the load balancer CPU utilization is higher than system usage threshold, the workload is too high for this load balancer. Rescale the load balancer service by changing the load balancer size from small to medium or from medium to large. If the CPU utilization of this load balancer is still high, consider adjusting the Edge appliance form factor size or moving load balancer services to other Edge nodes for the applicable workload.

3.0.0
LB Status Degraded Medium manager

Load balancer service is degraded.
When event detected: "The load balancer service {entity_id} is degraded. "
When event resolved: "The load balancer service {entity_id} is not degraded. "

For centralized load balancer: Check the load balancer status on standby Edge node as the degraded status means the load balancer status on standby Edge node is not ready. On standby Edge node, invoke the NSX CLI command get load-balancer <lb-uuid> status. If the LB-State of load balancer service is not_ready or there is no output, make the Edge node enter maintenance mode, then exit maintenance mode. For distributed load balancer:
1. Get detailed status by invoking NSX API GET /policy/api/v1/infra/lb-services/<LBService>/detailed-status?source=realtime
2. From API output, find ESXi host reporting a non-zero instance_number with status NOT_READY or CONFLICT.
3. On ESXi host node, invoke the NSX CLI command `get load-balancer <lb-uuid> status`. If 'Conflict LSP' is reported, check whether this LSP is attached to other load balancer service. Check whether this conflict is acceptable. If 'Not Ready LSP' is reported, check the status of this LSP by invoking NSX CLI command get logical-switch-port status. NOTE: You should ignore the alarm if it can be resolved automatically in 5 mins because the degraded status can be a transient status.

3.1.2
DLB Status Down Critical manager

Distributed load balancer service is down.
When event detected: "The distributed load balancer service {entity_id} is down. "
When event resolved: "The distributed load balancer service {entity_id} is up. "

On ESXi host node, invoke the NSX CLI command `get load-balancer <lb-uuid> status`. If 'Conflict LSP' is reported, check whether this LSP is attached to other load balancer service. Check whether this conflict is acceptable. If 'Not Ready LSP' is reported, check the status of this LSP by invoking NSX CLI command get logical-switch-port status.

3.1.2
LB Status Down Critical edge

Centralized load balancer service is down.
When event detected: "The centralized load balancer service {entity_id} is down. "
When event resolved: "The centralized load balancer service {entity_id} is up. "

On active Edge node, check load balancer status by invoking the NSX CLI command get load-balancer <lb-uuid> status. If the LB-State of load balancer service is not_ready or there is no output, make the Edge node enter maintenance mode, then exit maintenance mode.

3.0.0
Virtual Server Status Down Medium edge

Load balancer virtual service is down.
When event detected: "The load balancer virtual server {entity_id} is down. "
When event resolved: "The load balancer virtual server {entity_id} is up. "

Consult the load balancer pool to determine its status and verify its configuration. If incorrectly configured, reconfigure it and remove the load balancer pool from the virtual server then re-add it to the virtual server again.

3.0.0
Pool Status Down Medium edge

Load balancer pool is down.
When event detected: "The load balancer pool {entity_id} status is down. "
When event resolved: "The load balancer pool {entity_id} status is up "

Consult the load balancer pool to determine which members are down by invoking the NSX CLI command get load-balancer <lb-uuid> pool <pool-uuid> status or NSX API GET /policy/api/v1/infra/lb-services/<lb-service-id>/lb-pools/<lb-pool-id>/detailed-status If DOWN or UNKNOWN is reported, verify the pool member. Check network connectivity from the load balancer to the impacted pool members. Validate application health of each pool member. Also validate the health of each pool member using the configured monitor. When the health of the member is established, the pool member status is updated to healthy based on the 'Rise Count' configuration in the monitor. Remediate the issue by rebooting the pool member or make the Edge node enter maintenance mode, then exit maintenance mode.

3.0.0
LB Edge Capacity In Use High Medium edge

Load balancer usage is high.
When event detected: "The usage of load balancer service in Edge node {entity_id} is high. The threshold is {system_usage_threshold}%. "
When event resolved: "The usage of load balancer service in Edge node {entity_id} is low enough. The threshold is {system_usage_threshold}%. "

If multiple LB instances have been configurerd in this Edge node, deploy a new Edge node and move some LB instances to that new Edge node. If only a single LB instance (small/medium/etc) has been configured in an Edge node of same size (small/medium/etc), deploy a new Edge of bigger size and move the LB instance to that new Edge node.

3.1.2
LB Pool Member Capacity In Use Very High Critical edge

Load balancer pool member usage is very high.
When event detected: "The usage of pool members in Edge node {entity_id} is very high. The threshold is {system_usage_threshold}%. "
When event resolved: "The usage of pool members in Edge node {entity_id} is low enough. The threshold is {system_usage_threshold}%. "

Deploy a new Edge node and move the load balancer service from existing Edge nodes to the newly deployed Edge node.

3.1.2
Load Balancing Configuration Not Realized Due To Lack Of Memory Medium edge

Load balancer configuration is not realized due to high memory usage on Edge node.
When event detected: "The load balancer configuration {entity_id} is not realized, due to high memory usage on Edge node {transport_node_id}. "
When event resolved: "The load balancer configuration {entity_id} is realized on {transport_node_id}. "

Prefer defining small and medium sized load balancers over large sized load balancers. Spread out load balancer services among the available Edge nodes. Reduce number of Virtual Servers defined.

3.2.0

Logging Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Log Retention Time Too Low Info esx, edge, manager, public-cloud-gateway, global-manager

Log files will be deleted before the set retention period.
When event detected: "One or more log files on the node will be deleted before the set retention period due to excessive logging. {log_context} "
When event resolved: "Estimated log maximum duration is now equal to or larger than the expected duration. "

Follow the below steps to back up the log files before they are deleted. 1. Get the detailed report of log files on the node: {report_file_path}. 2. Review the Estimated Maximum Duration and Desired Duration in the detailed report, Estimated Maximum duration will indicate if the log files will be deleted before the retention period indicated by Desired Duration. If needed, backup old log files.

4.1.1
Remote Logging Not Configured Medium global-manager, manager

Remote logging not configured.
When event detected: "One or more {node_type_name} nodes are not currently configured to forward log messages to a remote logging server. "
When event resolved: "All {node_type_name} nodes are configured to forward log messages to at least one remote logging server currently. "

1. Invoke API GET /api/v1/configs/central-config/logging-servers to see the nodes on which remote logging is not configured.
2. For NSX Manager, Global Manager and Edge nodes, use NSX CLI set logging-server <hostname-or-ip-address[:port]> proto <proto> level <level> to configure a remote logging server and use NSX CLI get logging server to confirm if a remote logging server has been configured.
3. For ESXi nodes, use ESXi CLI esxcli system syslog config set --loghost=<str> and then esxcli system syslog reload to configure a remote logging server and use ESXi CLI esxcli system syslog config get and check Remote Host in response to confirm if the remote logging server has been configured.
4. If your setup only contains NSX Manager and Edge nodes, we recommend you to use Node Profile to configure the remote logging server. Go to System | Fabric | Profiles | Node Profiles | All NSX Nodes and configure the remote logging server in the Syslog Servers section. The configuration will be applied on all NSX Manager and Edge nodes.

4.1.2

Malware Prevention Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Service Status Down High manager

Service status is down.
When event detected: "Service {mps_service_name} is not running on {transport_node_name}. "
When event resolved: "Service {mps_service_name} is running properly on {transport_node_name}. "

On the {transport_node_type} transport node identified by {transport_node_name}, invoke the NSX CLI get services to check the status of {mps_service_name}. Inspect /var/log/syslog to find any suspecting error(s). Refer sections for {transport_node_type} transport node in KB.

4.0.1
File Extraction Service Unreachable High manager

Service status is degraded.
When event detected: "Service {mps_service_name} is degraded on {transport_node_name}. Unable to communicate with file extraction functionality. All file extraction abilities on the {transport_node_name} are paused. "
When event resolved: "Service {mps_service_name} is running properly on {transport_node_name}. "

On the {transport_node_type} transport node identified by {transport_node_name}, check the status of {mps_service_name} that is responsible for file_extraction. Inspect /var/log/syslog to find any suspecting error(s). Refer sections for {transport_node_type} transport node in KB.

4.0.1
Database Unreachable High manager

Service status is degraded.
When event detected: "Service {mps_service_name} is degraded on NSX Application Platform. It is unable to communicate with Malware Prevention database. "
When event resolved: "Service {mps_service_name} is running properly on NSX Application Platform. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is down and the reason behind it. Invoke the following CLI command to restart the degraded service: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> Determine the status of Malware Prevention Database service.

4.0.1
Analyst API Service Unreachable High manager

Service status is degraded.
When event detected: "Service {mps_service_name} is degraded on NSX Application Platform. It is unable to communicate with analyst_api service. Inspected file verdicts may not be up to date. "
When event resolved: "Service {mps_service_name} is running properly on NSX Application Platform. "

Analyst API service external to datacenter is unreachable. Check connectivity to internet. This could be temporary and may restore on its own. If this doesn't happen in minutes then it is advisable to collect the NSX Application platform support bundle and raise a support ticket with VMware support team.

4.0.1
NTICS Reputation Service Unreachable High manager

Service status is degraded.
When event detected: "Service {mps_service_name} is degraded on NSX Application Platform. It is unable to communicate with NTICS reputation service. Inspected file reputations may not be up to date. "
When event resolved: "Service {mps_service_name} is running properly on NSX Application Platform. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is down and the reason behind it. Invoke the following CLI command to restart the degraded service: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> Determine if access to NTICS service is down.

4.1.0
Service Disk Usage Very High High manager

Service disk usage is very high.
When event detected: "The {disk_purpose} disk usage for service {mps_service_name} on {transport_node_name} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The {disk_purpose} disk usage for service {mps_service_name} on {transport_node_name} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

On the {transport_node_type} transport node identified by {transport_node_name}, you may reduce the file retention period or alternatively in case of host node, reduce the Malware Prevention load by moving some VMs to another Host node. Refer sections for {transport_node_type} transport node in KB.

4.1.2
Service Disk Usage High Medium manager

Service disk usage is high.
When event detected: "The {disk_purpose} disk usage for service {mps_service_name} on {transport_node_name} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The {disk_purpose} disk usage for service {mps_service_name} on {transport_node_name} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

On the {transport_node_type} transport node identified by {transport_node_name}, you may reduce the file retention period or alternatively in case of host node, reduce the Malware Prevention load by moving some VMs to another Host node. Refer sections for {transport_node_type} transport node in KB.

4.1.2
Service VM CPU Usage High Medium manager

Malware Prevention Service VM CPU usage is high.
When event detected: "The CPU usage on Malware Prevention Service VM {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage on Malware prevention Service VM {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM.

4.1.2
Service VM CPU Usage Very High High manager

Malware Prevention Service VM CPU usage is very high.
When event detected: "The CPU usage on Malware Prevention Service VM {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage on Malware prevention Service VM {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM.

4.1.2
Service VM Memory Usage High Medium manager

Malware Prevention Service VM memory usage is high.
When event detected: "The memory usage on Malware Prevention Service VM {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage on Malware prevention Service VM {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM.

4.1.2
Service VM Memory Usage Very High High manager

Malware Prevention Service VM memory usage is very high.
When event detected: "The memory usage on Malware Prevention Service VM {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage on Malware prevention Service VM {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Migrate VMs out from the ESXi {nsx_esx_tn_name} containing SVM {entity_id} which is reporting high usage to reduce the load on SVM.

4.1.2

Manager Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Manager CPU Usage Very High Critical global-manager, manager

Manager node CPU usage is very high.
When event detected: "The CPU usage on Manager node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage on Manager node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size.

3.0.0
Manager CPU Usage High Medium global-manager, manager

Manager node CPU usage is high.
When event detected: "The CPU usage on Manager node {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage on Manager node {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size.

3.0.0
Manager Memory Usage Very High Critical global-manager, manager

Manager node memory usage is very high.
When event detected: "The memory usage on Manager node {entity_id} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage on Manager node {entity_id} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size.

3.0.0
Manager Memory Usage High Medium global-manager, manager

Manager node memory usage is high.
When event detected: "The memory usage on Manager node {entity_id} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage on Manager node {entity_id} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Review the configuration, running services and sizing of this Manager node. Consider adjusting the Manager appliance form factor size.

3.0.0
Manager Disk Usage Very High Critical global-manager, manager

Manager node disk usage is very high.
When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

Examine the partition with high usage and see if there are any unexpected large files that can be removed.

3.0.0
Manager Disk Usage High Medium global-manager, manager

Manager node disk usage is high.
When event detected: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage for the Manager node disk partition {disk_partition_name} has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Examine the partition with high usage and see if there are any unexpected large files that can be removed.

3.0.0
Manager Config Disk Usage Very High Critical global-manager, manager

Manager node config disk usage is very high.
When event detected: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of high disk usage by the NSX Datastore service under the /config/corfu directory. "
When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py

3.0.0
Manager Config Disk Usage High Medium global-manager, manager

Manager node config disk usage is high.
When event detected: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk usage by the NSX Datastore service under the /config/corfu directory. "
When event resolved: "The disk usage for the Manager node disk partition /config has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py

3.0.0
Operations Db Disk Usage Very High Critical manager

Manager node nonconfig disk usage is very high.
When event detected: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the very high threshold value of {system_usage_threshold}%. This can be an indication of high disk usage by the NSX Datastore service under the /nonconfig/corfu directory. "
When event resolved: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is below the very high threshold value of {system_usage_threshold}%. "

Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig

3.0.1
Operations Db Disk Usage High Medium manager

Manager node nonconfig disk usage is high.
When event detected: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is at or above the high threshold value of {system_usage_threshold}%. This can be an indication of rising disk usage by the NSX Datastore service under the /nonconfig/corfu directory. "
When event resolved: "The disk usage for the Manager node disk partition /nonconfig has reached {system_resource_usage}% which is below the high threshold value of {system_usage_threshold}%. "

Run the following tool and contact GSS if any issues are reported /opt/vmware/tools/support/inspect_checkpoint_issues.py --nonconfig

3.0.1
Duplicate IP Address Medium manager

Manager node's IP address is in use by another device.
When event detected: "Manager node {entity_id} IP address {duplicate_ip_address} is currently being used by another device in the network. "
When event resolved: "The device using the IP address assigned to Manager node {entity_id} appears to no longer be using {duplicate_ip_address}. "

1. Determine which device is using the Manager's IP address and assign the device a new IP address. Note, reconfiguring the Manager to use a new IP address is not supported.
2. Ensure the static IP address pool/DHCP server is configured correctly.
3. Correct the IP address of the device if it is manually assigned.

3.0.0
Storage Error Critical global-manager, manager

Manager node disk is read-only.
When event detected: "The following disk partition on the Manager node {entity_id} is in read-only mode: {disk_partition_name} "
When event resolved: "The following disk partition on the Manager node {entity_id} has recovered from read-only mode: {disk_partition_name} "

Examine the read-only partition to see if reboot resolves the issue or the disk needs to be replaced. Contact GSS for more information.

3.0.2
Missing DNS Entry For Manager FQDN Critical global-manager, manager

The DNS entry for the Manager FQDN is missing.
When event detected: "The DNS configuration for Manager node {manager_node_name} ({entity_id}) is incorrect. The Manager node is dual-stack and/or CA-signed API certificate is used, but the IP address(es) of the Manager node do not resolve to an FQDN or resolve to different FQDNs. "
When event resolved: "The DNS configuration for Manager node {manager_node_name} ({entity_id}) is correct. Either the Manager node is not dual-stack and CA-signed API certificate is no longer used, or the IP address(es) of the Manager node resolve to the same FQDN. "

1. Ensure proper DNS servers are configured in the Manager node.
2. Ensure proper A records and PTR records are configured in the DNS servers such that reverse lookup of the IP addresses of the Manager node return the same FQDN, and forward lookup of the FQDN return all IP addresses of the Manager node.
3. Alternatively, if the Manager node is not dual-stack, replace the CA-signed certificate for API service type with a self-signed certificate.

4.1.0
Missing DNS Entry For Vip FQDN Critical manager

Missing FQDN entry for the Manager VIP.
When event detected: "In case of dual stack or CA-signed API certificate for a NSX Manager, virtual IPv4 address {ipv4_address} and virtual IPv6 address {ipv6_address} for Manager node {entity_id} should resolve to the same FQDN. "
When event resolved: "Dual stack VIP addresses for Manager node {entity_id} resolved to same FQDN. "

Examine the DNS entry for the VIP addresses to see if they resolve to the same FQDN.

4.1.0
Different Manager IP Configuration In Cluster Critical global-manager, manager

Not all NSX Managers in the cluster have the same IPv4 and/or IPv6 address families configuration.
When event detected: "All NSX Managers in the cluster don't have the same IPv4 and/or IPv6 address families configuration. "
When event resolved: "All nodes in the cluster have same IPv4 and/or IPv6 address families configuration. "

1. Invoke the NSX CLI command 'get cluster status' to view the status.
2. Ensure all NSX Manager nodes in the cluster have same IP configuration, either IPv4-only, dual-stack or IPv6-only.

4.2.0

MTU Check Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
MTU Mismatch Within Transport Zone High manager

MTU configuration mismatch between Transport Nodes attached to the same Transport Zone.
When event detected: "MTU configuration mismatch between Transport Nodes (ESXi, KVM and Edge) attached to the same Transport Zone. MTU values on all switches attached to the same Transport Zone not being consistent will cause connectivity issues. "
When event resolved: "All MTU values between Transport Nodes attached to the same Transport Zone are consistent now. "

1. Navigate to System | Fabric | Settings | MTU Configuration Check | Inconsistent on the NSX UI to check more mismatch details.
2. Set the same MTU value on all switches attached to the same Transport Zone by invoking the NSX API PUT /api/v1/host-switch-profiles/<host-switch-profile-id> with mtu in the request body, or API PUT /api/v1/global-configs/SwitchingGlobalConfig with physical_uplink_mtu in request body.

3.2.0
Global Router MTU Too Big Medium manager

The global router MTU configuration is bigger than the MTU of overlay Transport Zone.
When event detected: "The global router MTU configuration is bigger than MTU of switches in overlay Transport Zone which connects to Tier0 or Tier1. Global router MTU value should be less than all switches MTU value by at least a 100 as we require 100 quota for Geneve encapsulation. "
When event resolved: "The global router MTU is less than the MTU of overlay Transport Zone now. "

1. Navigate to System | Fabric | Settings | MTU Configuration Check | Inconsistent on the NSX UI to check more mismatch details.
2. Set the bigger MTU value on switches by navigating to System | Fabric | Profiles | Uplink Profiles on the NSX UI, or by invoking the NSX API PUT /api/v1/host-switch-profiles/<host-switch-profile-id> with mtu in the request body, or API PUT /api/v1/global-configs/SwitchingGlobalConfig with physical_uplink_mtu in request body.
3. Or set the smaller MTU value of global router configuration by navigating to Networking | Global Networking Config | Global Gateway Configuration | Gateway Interface MTU on the NSX UI, or by invoking the NSX API PUT /api/v1/global-configs/RoutingGlobalConfig with logical_uplink_mtu in the request body.
4. The Tier0 uplink interface MTU when connected to VLAN networks do not have to be smaller than the overlay switching MTU or the same as the Global router MTU; this MTU should match the MTU of the BGP peer.

3.2.0

NAT Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
SNAT Port Usage On Gateway Is High Critical edge, public-cloud-gateway

SNAT port usage on the Gateway is high.
When event detected: "SNAT ports usage on logical router {entity_id} for SNAT IP {snat_ip_address} has reached the high threshold value of {system_usage_threshold}%. New flows will not be SNATed when usage reaches the maximum limit. "
When event resolved: "SNAT ports usage on logical router {entity_id} for SNAT IP {snat_ip_address} has reached below the high threshold value of {system_usage_threshold}%. "

Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> connection state by using the right interface uuid and check various SNAT mappings for the SNAT IP {snat_ip_address}. Check traffic flows going through the gateway is not a denial-of-service attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider adding more SNAT IP addresses to distribute the load or route new traffic to another Edge node.

3.2.0

NCP Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
NCP Plugin Down Critical manager

Manager Node has detected the NCP is down or unhealthy.
When event detected: "Manager Node has detected the NCP is down or unhealthy. "
When event resolved: "Manager Node has detected the NCP is up or healthy again. "

To find the clusters which are having issues, use the NSX UI and navigate to the Alarms page. The Entity name value for this alarm instance identifies the cluster name. Or invoke the NSX API GET /api/v1/systemhealth/container-cluster/ncp/status to fetch all cluster statuses and determine the name of any clusters that report DOWN or UNKNOWN. Then on the NSX UI Inventory | Container | Clusters page find the cluster by name and click the Nodes tab which lists all Kubernetes and PAS cluster members. For Kubernetes cluster:
1. Check NCP Pod liveness by finding the K8s leader node from all the cluster members and log onto the leader node. Then invoke the kubectl command kubectl get pods --all-namespaces. If there is an issue with the NCP Pod, use kubectl logs command to check the issue and fix the error.
2. Check the connection between NCP and Kubernetes API server. The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following commands from the leader VM. kubectl exec -it <NCP-Pod-Name> -n nsx-system bash nsxcli get ncp-k8s-api-server status If there is an issue with the connection, check both the network and NCP configurations.
3. Check the connection between NCP and NSX Manager. The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following command from the leader VM. kubectl exec -it <NCP-Pod-Name> -n nsx-system bash nsxcli get ncp-nsx status If there is an issue with the connection, check both the network and NCP configurations. For PAS cluster:
1. Check the network connections between virtual machines and fix any network issues.
2. Check the status of both nodes and services and fix crashed nodes or services. Invoke the command bosh vms and bosh instances -p to check the status of nodes and services.

3.0.0

Node Agents Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Node Agents Down On DPU High dpu

The agents running inside the Node VM appear to be down on DPU.
When event detected: "The agents running inside the Node VM appear to be down on DPU {dpu_id}. "
When event resolved: "The agents inside the Node VM are running on DPU {dpu_id}. "

1. If Vmk50 on DPU {dpu_id} is missing, refer to this Knowledge Base article https://kb.vmware.com/s/article/67432 .
2. If Hyperbus 4094 on DPU {dpu_id} is missing, restarting nsx-cfgagent on DPU {dpu_id} or restarting the container host VM may help.
3. If container host VIF is blocked, check the connection to the Controller to make sure all configurations are sent down.
4. If nsx-cfg-agent on DPU {dpu_id} has stopped, restart nsx-cfgagent on DPU {dpu_id} .
5. If the node-agent package is missing, check whether node-agent package has been successfully installed in the container host vm.
6. If the interface for node-agent in container host vm is down, check the eth1 interface status inside the container host vm.

4.0.0
Node Agents Down High esx, kvm

The agents running inside the Node VM appear to be down.
When event detected: "The agents running inside the Node VM appear to be down. "
When event resolved: "The agents inside the Node VM are running. "

For ESX:
1. If Vmk50 is missing, refer to this Knowledge Base article https://kb.vmware.com/s/article/67432 .
2. If Hyperbus 4094 is missing, restarting nsx-cfgagent or restarting the container host VM may help.
3. If container host VIF is blocked, check the connection to the Controller to make sure all configurations are sent down.
4. If nsx-cfg-agent has stopped, restart nsx-cfgagent. For KVM:
1. If Hyperbus namespace is missing, restarting the nsx-opsagent may help recreate the namespace.
2. If Hyperbus interface is missing inside the hyperbus namespace, restarting the nsx-opsagent may help.
3. If nsx-agent has stopped, restart nsx-agent. For Both ESX and KVM:
1. If the node-agent package is missing, check whether node-agent package has been successfully installed in the container host vm.
2. If the interface for node-agent in container host vm is down, check the eth1 interface status inside the container host vm.

3.0.0

NSX Application Platform Communication Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Manager Disconnected High manager, intelligence

The NSX Application Platform cluster is disconnected from the NSX management cluster.
When event detected: "The NSX Application Platform cluster {napp_cluster_id} is disconnected from the NSX management cluster. "
When event resolved: "The NSX Application Platform cluster {napp_cluster_id} is reconnected to the NSX management cluster. "

Check whether the manager cluster certificate, manager node certificates, kafka certificate and ingress certificate match on both NSX Manager and the NSX Application Platform cluster. Check expiration dates of the above mentioned certificates to make sure they are valid. Check the network connection between NSX Manager and NSX Application Platform cluster and resolve any network connection failures.

3.2.0
Delay Detected In Messaging Rawflow Critical manager, intelligence

Slow data processing detected in messaging topic Raw Flow.
When event detected: "The number of pending messages in the messaging topic Raw Flow is above the pending message threshold of {napp_messaging_lag_threshold}. "
When event resolved: "The number of pending messages in the messaging topic Raw Flow is below the pending message threshold of {napp_messaging_lag_threshold}. "

Add nodes and then scale up the NSX Application Platform cluster. If the bottleneck can be attributed to a specific service, for example, the analytics service, then scale up the specific service when the new nodes are added.

3.2.0
Delay Detected In Messaging Overflow Critical manager, intelligence

Slow data processing detected in messaging topic Over Flow.
When event detected: "The number of pending messages in the messaging topic Over Flow is above the pending message threshold of {napp_messaging_lag_threshold}. "
When event resolved: "The number of pending messages in the messaging topic Over Flow is below the pending message threshold of {napp_messaging_lag_threshold}. "

Add nodes and then scale up the NSX Application Platform cluster. If bottleneck can be attributed to a specific service, for example, the analytics service, then scale up the specific service when the new nodes are added.

3.2.0
TN Flow Exp Disconnected High esx, kvm, bms

A Transport node is disconnected from its NSX Messaging Broker.
When event detected: "The flow exporter on Transport node {entity_id} is disconnected from its messaging broker {messaging_broker_info}. Data collection is affected. "
When event resolved: "The flow exporter on Transport node {entity_id} has reconnected to its messaging broker {messaging_broker_info}. "

Restart the messaging service if it is not running. Resolve the network connection failure between the Transport node flow exporter and its NSX messaging broker.

3.2.0
TN Flow Exp Disconnected On DPU High dpu

A Transport node is disconnected from its NSX messaging broker.
When event detected: "The flow exporter on Transport node {entity_id} DPU {dpu_id} is disconnected from its messaging broker {messaging_broker_info}. Data collection is affected. "
When event resolved: "The flow exporter on Transport node {entity_id} DPU {dpu_id} has reconnected to its messaging broker {messaging_broker_info}. "

Restart the messaging service if it is not running. Resolve the network connection failure between the Transport node flow exporter and its NSX messaging broker.

4.0.0

NSX Application Platform Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Cluster CPU Usage Very High Critical manager, intelligence

NSX Application Platform cluster CPU usage is very high.
When event detected: "The CPU usage of NSX Application Platform cluster {napp_cluster_id} is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of NSX Application Platform cluster {napp_cluster_id} is below the very high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if the load can be reduced. If more computing power is required, click on the Scale Out button to request more resources.

3.2.0
Cluster CPU Usage High Medium manager, intelligence

NSX Application Platform cluster CPU usage is high.
When event detected: "The CPU usage of NSX Application Platform cluster {napp_cluster_id} is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of NSX Application Platform cluster {napp_cluster_id} is below the high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if the load can be reduced. If more computing power is required, click on the Scale Out button to request more resources.

3.2.0
Cluster Memory Usage Very High Critical manager, intelligence

NSX Application Platform cluster memory usage is very high.
When event detected: "The memory usage of NSX Application Platform cluster {napp_cluster_id} is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of NSX Application Platform cluster {napp_cluster_id} is below the very high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if the load can be reduced. If more memory is required, click on the Scale Out button to request more resources.

3.2.0
Cluster Memory Usage High Medium manager, intelligence

NSX Application Platform cluster memory usage is high.
When event detected: "The memory usage of NSX Application Platform cluster {napp_cluster_id} is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of NSX Application Platform cluster {napp_cluster_id} is below the high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if the load can be reduced. If more memory is required, click on the Scale Out button to request more resources.

3.2.0
Cluster Disk Usage Very High Critical manager, intelligence

NSX Application Platform cluster disk usage is very high.
When event detected: "The disk usage of NSX Application Platform cluster {napp_cluster_id} is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of NSX Application Platform cluster {napp_cluster_id} is below the very high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. See if the load can be reduced. If more disk storage is required, click on the Scale Out button to request more resources. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size.

3.2.0
Cluster Disk Usage High Medium manager, intelligence

NSX Application Platform cluster disk usage is high.
When event detected: "The disk usage of NSX Application Platform cluster {napp_cluster_id} is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of NSX Application Platform cluster {napp_cluster_id} is below the high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. See if the load can be reduced. If more disk storage is required, click on the Scale Out button to request more resources. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size.

3.2.0
Napp Status Degraded Medium manager, intelligence

NSX Application Platform cluster overall status is degraded.
When event detected: "NSX Application Platform cluster {napp_cluster_id} overall status is degraded. "
When event resolved: "NSX Application Platform cluster {napp_cluster_id} is running properly. "

Get more information from alarms of nodes and services.

3.2.0
Napp Status Down High manager, intelligence

NSX Application Platform cluster overall status is down.
When event detected: "NSX Application Platform cluster {napp_cluster_id} overall status is down. "
When event resolved: "NSX Application Platform cluster {napp_cluster_id} is running properly. "

Get more information from alarms of nodes and services.

3.2.0
Node CPU Usage Very High Critical manager, intelligence

NSX Application Platform node CPU usage is very high.
When event detected: "The CPU usage of NSX Application Platform node {napp_node_name} is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of NSX Application Platform node {napp_node_name} is below the very high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high CPU usage, by default, Kubernetes will reschedule services automatically. If most nodes have high CPU usage and load cannot be reduced, click on the Scale Out button to request more resources.

3.2.0
Node CPU Usage High Medium manager, intelligence

NSX Application Platform node CPU usage is high.
When event detected: "The CPU usage of NSX Application Platform node {napp_node_name} is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of NSX Application Platform node {napp_node_name} is below the high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the System Load field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high CPU usage, by default, Kubernetes will reschedule services automatically. If most nodes have high CPU usage and load cannot be reduced, click on the Scale Out button to request more resources.

3.2.0
Node Memory Usage Very High Critical manager, intelligence

NSX Application Platform node memory usage is very high.
When event detected: "The memory usage of NSX Application Platform node {napp_node_name} is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of NSX Application Platform node {napp_node_name} is below the very high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high Memory usage, by default, Kubernetes will reschedule services automatically. If most nodes have high Memory usage and load cannot be reduced, click on the Scale Out button to request more resources.

3.2.0
Node Memory Usage High Medium manager, intelligence

NSX Application Platform node memory usage is high.
When event detected: "The memory usage of NSX Application Platform node {napp_node_name} is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of NSX Application Platform node {napp_node_name} is below the high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Memory field of individual services to see which service is under pressure. See if load can be reduced. If only a small minority of the nodes have high Memory usage, by default, Kubernetes will reschedule services automatically. If most nodes have high Memory usage and load cannot be reduced, click on the Scale Out button to request more resources.

3.2.0
Node Disk Usage Very High Critical manager, intelligence

NSX Application Platform node disk usage is very high.
When event detected: "The disk usage of NSX Application Platform node {napp_node_name} is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of NSX Application Platform node {napp_node_name} is below the very high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. Clean up unused data or log to free up disk resources and see if the load can be reduced. If more disk storage is required, Scale Out the service under pressure. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size.

3.2.0
Node Disk Usage High Medium manager, intelligence

NSX Application Platform node disk usage is high.
When event detected: "The disk usage of NSX Application Platform node {napp_node_name} is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of NSX Application Platform node {napp_node_name} is below the high threshold value of {system_usage_threshold}%. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services and check the Storage field of individual services to see which service is under pressure. Clean up unused data or log to free up disk resources and see if the load can be reduced. If more disk storage is required, Scale Out the service under pressure. If data storage service is under strain, another way is to click on the Scale Up button to increase disk size.

3.2.0
Node Status Degraded Medium manager, intelligence

NSX Application Platform node status is degraded.
When event detected: "NSX Application Platform node {napp_node_name} is degraded. "
When event resolved: "NSX Application Platform node {napp_node_name} is running properly. "

In the NSX UI, navigate to System | NSX Application Platform | Resources to check which node is degraded. Check network, memory and CPU usage of the node. Reboot the node if it is a worker node.

3.2.0
Node Status Down High manager, intelligence

NSX Application Platform node status is down.
When event detected: "NSX Application Platform node {napp_node_name} is not running. "
When event resolved: "NSX Application Platform node {napp_node_name} is running properly. "

In the NSX UI, navigate to System | NSX Application Platform | Resources to check which node is down. Check network, memory and CPU usage of the node. Reboot the node if it is a worker node.

3.2.0
Datastore CPU Usage Very High Critical manager, intelligence

Data Storage service CPU usage is very high.
When event detected: "The CPU usage of Data Storage service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Data Storage service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services or the Data Storage service.

3.2.0
Datastore CPU Usage High Medium manager, intelligence

Data Storage service CPU usage is high.
When event detected: "The CPU usage of Data Storage service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Data Storage service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services or the Data Storage service.

3.2.0
Messaging CPU Usage Very High Critical manager, intelligence

Messaging service CPU usage is very high.
When event detected: "The CPU usage of Messaging service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Messaging service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services or the Messaging service.

3.2.0
Messaging CPU Usage High Medium manager, intelligence

Messaging service CPU usage is high.
When event detected: "The CPU usage of Messaging service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Messaging service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services or the Messaging service.

3.2.0
Configuration Db CPU Usage Very High Critical manager, intelligence

Configuration Database service CPU usage is very high.
When event detected: "The CPU usage of Configuration Database service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Configuration Database service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Configuration Db CPU Usage High Medium manager, intelligence

Configuration Database service CPU usage is high.
When event detected: "The CPU usage of Configuration Database service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Configuration Database service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Metrics CPU Usage Very High Critical manager, intelligence

Metrics service CPU usage is very high.
When event detected: "The CPU usage of Metrics service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Metrics service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Metrics CPU Usage High Medium manager, intelligence

Metrics service CPU usage is high.
When event detected: "The CPU usage of Metrics service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Metrics service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Analytics CPU Usage Very High Critical manager, intelligence

Analytics service CPU usage is very high.
When event detected: "The CPU usage of Analytics service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Analytics service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services or the Analytics service.

3.2.0
Analytics CPU Usage High Medium manager, intelligence

Analytics service CPU usage is high.
When event detected: "The CPU usage of Analytics service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Analytics service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services or the Analytics service.

3.2.0
Platform CPU Usage Very High Critical manager, intelligence

Platform Services service CPU usage is very high.
When event detected: "The CPU usage of Platform Services service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Platform Services service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Platform CPU Usage High Medium manager, intelligence

Platform Services service CPU usage is high.
When event detected: "The CPU usage of Platform Services service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The CPU usage of Platform Services service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Datastore Memory Usage Very High Critical manager, intelligence

Data Storage service memory usage is very high.
When event detected: "The memory usage of Data Storage service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Data Storage service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services or the Data Storage service.

3.2.0
Datastore Memory Usage High Medium manager, intelligence

Data Storage service memory usage is high.
When event detected: "The memory usage of Data Storage service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Data Storage service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services or the Data Storage service.

3.2.0
Messaging Memory Usage Very High Critical manager, intelligence

Messaging service memory usage is very high.
When event detected: "The memory usage of Messaging service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Messaging service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services or the Messaging service.

3.2.0
Messaging Memory Usage High Medium manager, intelligence

Messaging service memory usage is high.
When event detected: "The memory usage of Messaging service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Messaging service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services or the Messaging service.

3.2.0
Configuration Db Memory Usage Very High Critical manager, intelligence

Configuration Database service memory usage is very high.
When event detected: "The memory usage of Configuration Database service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Configuration Database service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Configuration Db Memory Usage High Medium manager, intelligence

Configuration Database service memory usage is high.
When event detected: "The memory usage of Configuration Database service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Configuration Database service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Metrics Memory Usage Very High Critical manager, intelligence

Metrics service memory usage is very high.
When event detected: "The memory usage of Metrics service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Metrics service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Metrics Memory Usage High Medium manager, intelligence

Metrics service memory usage is high.
When event detected: "The memory usage of Metrics service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Metrics service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Analytics Memory Usage Very High Critical manager, intelligence

Analytics service memory usage is very high.
When event detected: "The memory usage of Analytics service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Analytics service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services or the Analytics service.

3.2.0
Analytics Memory Usage High Medium manager, intelligence

Analytics service memory usage is high.
When event detected: "The memory usage of Analytics service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Analytics service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services or the Analytics service.

3.2.0
Platform Memory Usage Very High Critical manager, intelligence

Platform Services service memory usage is very high.
When event detected: "The memory usage of Platform Services service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Platform Services service is below the very high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Platform Memory Usage High Medium manager, intelligence

Platform Services service memory usage is high.
When event detected: "The memory usage of Platform Services service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The memory usage of Platform Services service is below the high threshold value of {system_usage_threshold}%. "

Scale out all services.

3.2.0
Datastore Disk Usage Very High Critical manager, intelligence

Data Storage service disk usage is very high.
When event detected: "The disk usage of Data Storage service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Data Storage service is below the very high threshold value of {system_usage_threshold}%. "

Scale out or scale up the data storage service.

3.2.0
Datastore Disk Usage High Medium manager, intelligence

Data Storage service disk usage is high.
When event detected: "The disk usage of Data Storage service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Data Storage service is below the high threshold value of {system_usage_threshold}%. "

Scale out or scale up the data storage service.

3.2.0
Messaging Disk Usage Very High Critical manager, intelligence

Messaging service disk usage is very high.
When event detected: "The disk usage of Messaging service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Messaging service is below the very high threshold value of {system_usage_threshold}%. "

Clean up files not needed. Scale out all services or the Messaging service.

3.2.0
Messaging Disk Usage High Medium manager, intelligence

Messaging service disk usage is high.
When event detected: "The disk usage of Messaging service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Messaging service is below the high threshold value of {system_usage_threshold}%. "

Clean up files not needed. Scale out all services or the Messaging service.

3.2.0
Configuration Db Disk Usage Very High Critical manager, intelligence

Configuration Database service disk usage is very high.
When event detected: "The disk usage of Configuration Database service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Configuration Database service is below the very high threshold value of {system_usage_threshold}%. "

Clean up files not needed. Scale out all services.

3.2.0
Configuration Db Disk Usage High Medium manager, intelligence

Configuration Database service disk usage is high.
When event detected: "The disk usage of Configuration Database service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Configuration Database service is below the high threshold value of {system_usage_threshold}%. "

Clean up files not needed. Scale out all services.

3.2.0
Metrics Disk Usage Very High Critical manager, intelligence

Metrics service disk usage is very high.
When event detected: "The disk usage of Metrics service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Metrics service is below the very high threshold value of {system_usage_threshold}%. "

Contact VMware support to review storage usage.

3.2.0
Metrics Disk Usage High Medium manager, intelligence

Metrics service disk usage is high.
When event detected: "The disk usage of Metrics service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Metrics service is below the high threshold value of {system_usage_threshold}%. "

Contact VMware support to review storage usage.

3.2.0
Analytics Disk Usage Very High Critical manager, intelligence

Analytics service disk usage is very high.
When event detected: "The disk usage of Analytics service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Analytics service is below the very high threshold value of {system_usage_threshold}%. "

Clean up files not needed. Scale out all services or the Analytics service.

3.2.0
Analytics Disk Usage High Medium manager, intelligence

Analytics service disk usage is high.
When event detected: "The disk usage of Analytics service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Analytics service is below the high threshold value of {system_usage_threshold}%. "

Clean up files not needed. Scale out all services or the Analytics service.

3.2.0
Platform Disk Usage Very High Critical manager, intelligence

Platform Services service disk usage is very high.
When event detected: "The disk usage of Platform Services service is above the very high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Platform Services service is below the very high threshold value of {system_usage_threshold}%. "

Clean up files not needed. Scale out all services.

3.2.0
Platform Disk Usage High Medium manager, intelligence

Platform Services service disk usage is high.
When event detected: "The disk usage of Platform Services service is above the high threshold value of {system_usage_threshold}%. "
When event resolved: "The disk usage of Platform Services service is below the high threshold value of {system_usage_threshold}%. "

Clean up files not needed. Scale out all services.

3.2.0
Service Status Degraded Medium manager, intelligence

Service status is degraded.
When event detected: "Service {napp_service_name} is degraded. The service may still be able to reach a quorum while pods associated with {napp_service_name} are not all stable. Resources consumed by these unstable pods may be released. "
When event resolved: "Service {napp_service_name} is running properly. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is degraded and the reason behind it. Invoke the following CLI command to restart the degraded service if necessary: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace> Degraded services can function correctly but performance is sub-optimal.

3.2.0
Service Status Down High manager, intelligence

Service status is down.
When event detected: "Service {napp_service_name} is not running. "
When event resolved: "Service {napp_service_name} is running properly. "

In the NSX UI, navigate to System | NSX Application Platform | Core Services to check which service is degraded. Invoke the NSX API GET /napp/api/v1/platform/monitor/feature/health to check which specific service is down and the reason behind it. Invoke the following CLI command to restart the degraded service: kubectl rollout restart <statefulset/deployment> <service_name> -n <namespace>

3.2.0
Flow Storage Growth High Medium manager, intelligence

Analytics and Data Storage disk usage is growing faster than expected.
When event detected: "Analytics and Data Storage disks are expected to be full in {predicted_full_period} days, less than current data retention period {current_retention_period} days. "
When event resolved: "Analytics and Data Storage disk usage growth is normal. "

Connect less transport nodes or set narrower private IP ranges to reduce the number of unique flows. Filter out broadcast and/or multcast flows. Scale out Analytics and Data Storage services to get more storage.

4.1.1

Password Management Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Password Expired Critical global-manager, manager, edge, public-cloud-gateway

User password has expired.
When event detected: "The password for user {username} has expired. "
When event resolved: "The password for user {username} has been changed successfully or is no longer expired or the user is no longer active. "

The password for user {username} must be changed now to access the system. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body: PUT /api/v1/node/users/<userid> where <userid> is the ID of the user. If the admin user (with <userid> 10000) password has expired, admin must login to the system via SSH (if enabled) or console in order to change the password. Upon entering the current expired password, admin will be prompted to enter a new password.

3.0.0
Password Is About To Expire High global-manager, manager, edge, public-cloud-gateway

User password is about to expire.
When event detected: "The password for user {username} is about to expire in {password_expiration_days} days. "
When event resolved: "The password for the user {username} has been changed successfully or is no longer expired or the user is no longer active. "

Ensure the password for the user {username} is changed immediately. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body: PUT /api/v1/node/users/<userid> where <userid> is the ID of the user.

3.0.0
Password Expiration Approaching Medium global-manager, manager, edge, public-cloud-gateway

User password is approaching expiration.
When event detected: "The password for user {username} is approaching expiration in {password_expiration_days} days. "
When event resolved: "The password for the user {username} has been changed successfully or is no longer expired or the user is no longer active. "

The password for the user {username} needs to be changed soon. For example, to apply a new password to a user, invoke the following NSX API with a valid password in the request body: PUT /api/v1/node/users/<userid> where <userid> is the ID of the user.

3.0.0

Physical Server Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Physical Server Install Failed Critical manager

Physical Server (BMS) installation failed.
When event detected: "Physical Server {transport_node_name} ({entity_id}) installation failed. "
When event resolved: "Physical Server {transport_node_name} ({entity_id}) installation completed. "

Navigate to System > Fabric > Nodes > Host Transport Nodes and resolve the error on the node.

4.0.0
Physical Server Upgrade Failed Critical manager

Physical Server (BMS) upgrade failed.
When event detected: "Physical Server {transport_node_name} ({entity_id}) upgrade failed. "
When event resolved: "Physical Server {transport_node_name} ({entity_id}) upgrade completed. "

Navigate to System > Upgrade and resolve the error, then re-trigger the upgrade.

4.0.0
Physical Server Uninstall Failed Critical manager

Physical Server (BMS) uninstallation failed.
When event detected: "Physical Server {transport_node_name} ({entity_id}) uninstallation failed. "
When event resolved: "Physical Server {transport_node_name} ({entity_id}) uninstallation completed. "

Navigate to System > Fabric > Nodes > Host Transport Nodes and resolve the error on the node.

4.0.0

Policy Constraint Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Creation Count Limit Reached Medium manager

Entity count has reached the policy constraint limit.
When event detected: "Entity count for type {constraint_type} in {constraint_type_path} is currently at {current_count} which reached the maximum limit of {constraint_limit}. "
When event resolved: "{constraint_type} Count is below threshold. "

Review {constraint_type} usage. Update the constraint to increase the limit or delete unused {constraint_type}.

4.1.0
Creation Count Limit Reached For Project Medium manager

Entity count has reached the policy constraint limit.
When event detected: "Entity count for type {constraint_type} in {project_path} is currently at {current_count} which reached the maximum limit of {constraint_limit}. "
When event resolved: "{constraint_type} Count is below threshold. "

Review {constraint_type} usage. Update the constraint to increase the limit or delete unused {constraint_type}.

4.1.1

Routing Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
BFD Down On External Interface High edge, autonomous-edge, public-cloud-gateway

BFD session is down.
When event detected: "In router {lr_id}, BFD session for peer {peer_address} is down. "
When event resolved: "In router {lr_id}, BFD session for peer {peer_address} is up. "

1. Invoke the NSX CLI command get logical-routers.
2. Switch to the service router {sr_id}
3. Invoke the NSX CLI command ping {peer_address} to verify the connectivity.

3.0.0
Static Routing Removed High edge, autonomous-edge, public-cloud-gateway

Static route removed.
When event detected: "In router {lr_id}, static route {entity_id} ({static_address}) was removed because BFD was down. "
When event resolved: "In router {lr_id}, static route {entity_id} ({static_address}) was re-added as BFD recovered. "

The static routing entry was removed because the BFD session was down.
1. Invoke the NSX CLI command get logical-routers.
2. Switch to the service-router {sr_id}.
3. Invoke the NSX CLI command ping <BFD peer IP address> to verify the connectivity. Also, verify the configuration in both NSX and the BFD peer to ensure that timers have not been changed.

3.0.0
BGP Down High edge, autonomous-edge, public-cloud-gateway

BGP neighbor down.
When event detected: "In Router {lr_id}, BGP neighbor {entity_id} ({bgp_neighbor_ip}) is down. Reason: {failure_reason}. "
When event resolved: "In Router {lr_id}, BGP neighbor {entity_id} ({bgp_neighbor_ip}) is up. "

1. Invoke the NSX CLI command get logical-routers.
2. Switch to service-router {sr_id}. If the reason indicates Network or config error -
3. Invoke the NSX CLI command get bgp neighbor summary to check the BGP neighbor status. If the reason indicates Edge is not ready, check why the Edge node is not in good state.
4. Invoke the NSX CLI command get edge-cluster status to check reason why Edge node might be down.
5. Invoke the NSX CLI commands get bfd-config and get bfd-sessions to check if BFD is running well.
6. Check any Edge health related alarms to get more information. Check /var/log/syslog to see if there are any errors related to BGP connectivity.

3.0.0
Proxy ARP Not Configured For Service IP Critical manager

Proxy ARP is not configured for Service IP.
When event detected: "Proxy ARP for Service IP {service_ip} and Service entity {entity_id} is not configured as the number of ARP proxy entries generated due to overlap of the Service IP with subnet of lrport {lrport_id} on Router {lr_id} has exceeded the allowed threshold limit of 16384. "
When event resolved: "Proxy ARP for Service entity {entity_id} is generated successfully as the overlap of service IP with subnet of lrport {lrport_id} on Router {lr_id} is within the allowed limit of 16384 entries. "

Reconfigure the Service IP {service_ip} for the Service entity {entity_id} or change the subnet of the lrport {lrport_id} on Router {lr_id} so that the proxy ARP entries generated due to the overlap between the Service IP and the subnet of lrport is less than the allowed threshold limit of 16384.

3.0.3
Routing Down High edge, autonomous-edge, public-cloud-gateway

All BGP/BFD sessions are down.
When event detected: "All BGP/BFD sessions are down. "
When event resolved: "At least one BGP/BFD session up. "

Invoke the NSX CLI command get logical-routers to get the tier0 service router and switch to this vrf, then invoke the following NSX CLI commands.
1. ping <BFD peer IP address> to verify connectivity.
2. get bfd-config and get bfd-sessions to check if BFD is running well.
3. get bgp neighbor summary to check if BGP is running well. Also check /var/log/syslog to see if there are any errors related to BGP connectivity.

3.0.0
OSPF Neighbor Went Down High edge, autonomous-edge, public-cloud-gateway

OSPF neighbor moved from full to another state.
When event detected: "OSPF neighbor {peer_address} moved from full to another state. "
When event resolved: "OSPF neighbor {peer_address} moved to full state. "

1. Invoke the NSX CLI command get logical-routers to get the vrf id and switch to TIER0 service router.
2. Run get ospf neighbor to check the current state of this neighbor. If the neighbor is not listed in the output, the neighbor has gone down or out of the network.
3. Invoke the NSX CLI command ping <OSPF neighbor IP address> to verify the connectivity.
4. Also, verify the configuration for both NSX and peer router to ensure that timers and area-id match.
5. Check /var/log/syslog to see if there are any errors related to connectivity.

3.1.1
Maximum IPv4 Route Limit Approaching Medium edge, autonomous-edge, public-cloud-gateway

Maximum IPv4 Routes limit is approaching on Edge node.
When event detected: "IPv4 routes limit has reached {route_limit_threshold} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "
When event resolved: "IPv4 routes are within the limit of {route_limit_threshold} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "

1. Check route redistribution policies and routes received from all external peers.
2. Consider reducing the number of routes by applying routing policies and filters accordingly.
3. Check if inter-SR routing is enabled or route leaking between VRFs is enabled. This can increase the total number of routes on the edge.
4. To get the total ipv4 route count across all VRF in edge use NSX CLI 'get route vrf all ipv4' or policy API 'https://<policy-ip>/policy/api/v1/infra/tier-0s/<tier-0>/number-of-routes?edge_path=<edge-path>&include_child_vrf=true'

4.0.0
Maximum IPv6 Route Limit Approaching Medium edge, autonomous-edge, public-cloud-gateway

Maximum IPv6 Routes limit is approaching on Edge node.
When event detected: "IPv6 routes limit has reached {route_limit_threshold} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "
When event resolved: "IPv6 routes are within the limit of {route_limit_threshold} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "

1. Check route redistribution policies and routes received from all external peers.
2. Consider reducing the number of routes by applying routing policies and filters accordingly.
3. Check if inter-SR routing is enabled or route leaking between VRFs is enabled. This can increase the total number of routes on the edge.
4. To get the total ipv6 route count across all VRF in edge use NSX CLI 'get route vrf all ipv6' or policy API 'https://<policy-ip>/policy/api/v1/infra/tier-0s/<tier-0>/number-of-routes?edge_path=<edge-path>&include_child_vrf=true'

4.0.0
Maximum IPv4 Route Limit Exceeded Critical edge, autonomous-edge, public-cloud-gateway

Maximum IPv4 Routes limit has exceeded on Edge node.
When event detected: "IPv4 routes has exceeded limit of {route_limit_maximum} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "
When event resolved: "IPv4 routes are within the limit of {route_limit_maximum} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "

1. Check route redistribution policies and routes received from all external peers.
2. Consider reducing the number of routes by applying routing policies and filters accordingly.
3. Check if inter-SR routing is enabled or route leaking between VRFs is enabled. This can increase the total number of routes on the edge.
4. To get the total ipv4 route count across all VRF in edge use NSX CLI 'get route vrf all ipv4' or policy API 'https://<policy-ip>/policy/api/v1/infra/tier-0s/<tier-0>/number-of-routes?edge_path=<edge-path>&include_child_vrf=true'

4.0.0
Maximum IPv6 Route Limit Exceeded Critical edge, autonomous-edge, public-cloud-gateway

Maximum IPv6 Routes limit has exceeded on Edge node.
When event detected: "IPv6 routes has exceeded limit of {route_limit_maximum} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "
When event resolved: "IPv6 routes are within the limit of {route_limit_maximum} on Tier0 Gateway and all Tier0 VRFs on Edge node {edge_node}. The limit also includes inter-SR routes. "

1. Check route redistribution policies and routes received from all external peers.
2. Consider reducing the number of routes by applying routing policies and filters accordingly.
3. Check if inter-SR routing is enabled or route leaking between VRFs is enabled. This can increase the total number of routes on the edge.
4. To get the total ipv6 route count across all VRF in edge use NSX CLI 'get route vrf all ipv6' or policy API 'https://<policy-ip>/policy/api/v1/infra/tier-0s/<tier-0>/number-of-routes?edge_path=<edge-path>&include_child_vrf=true'

4.0.0
Maximum IPv4 Prefixes From BGP Neighbor Approaching Medium edge, autonomous-edge, public-cloud-gateway

Maximum IPv4 Prefixes received from BGP neighbor is approaching.
When event detected: "Number of IPv4 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} reaches {prefixes_count_threshold}. Limit defined for this peer is {prefixes_count_max}. "
When event resolved: "Number of IPv4 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} is within the limit {prefixes_count_threshold}. "

1. Check the BGP routing policies in the external router.
2. Consider reducing the number of routes advertised by the BGP peer by applying routing policies and filters to the external router.
3. If required, increase the maximum prefixes settings under the BGP neighbor configuration section.

4.0.0
Maximum IPv6 Prefixes From BGP Neighbor Approaching Medium edge, autonomous-edge, public-cloud-gateway

Maximum IPv6 Prefixes received from BGP neighbor is approaching.
When event detected: "Number of IPv6 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} reaches {prefixes_count_threshold}. Limit defined for this peer is {prefixes_count_max}. "
When event resolved: "Number of IPv6 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} is within the limit {prefixes_count_threshold}. "

1. Check the BGP routing policies in the external router.
2. Consider reducing the number of routes advertised by the BGP peer by applying routing policies and filters to the external router.
3. If required, increase the maximum prefixes settings under the BGP neighbor configuration section.

4.0.0
Maximum IPv4 Prefixes From BGP Neighbor Exceeded Critical edge, autonomous-edge, public-cloud-gateway

Maximum IPv4 Prefixes received from BGP neighbor has exceeded.
When event detected: "Number of IPv4 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} exceeded the limit defined for this peer of {prefixes_count_max}. "
When event resolved: "Number of IPv4 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} is within the limit {prefixes_count_max}. "

1. Check the BGP routing policies in the external router.
2. Consider reducing the number of routes advertised by the BGP peer by applying routing policies and filters to the external router.
3. If required, increase the maximum prefixes settings under the BGP neighbor configuration section.

4.0.0
Maximum IPv6 Prefixes From BGP Neighbor Exceeded Critical edge, autonomous-edge, public-cloud-gateway

Maximum IPv6 Prefixes received from BGP neighbor has exceeded.
When event detected: "Number of IPv6 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} exceeded the limit defined for this peer of {prefixes_count_max}. "
When event resolved: "Number of IPv6 {subsequent_address_family} prefixes received from {bgp_neighbor_ip} is within the limit {prefixes_count_max}. "

1. Check the BGP routing policies in the external router.
2. Consider reducing the number of routes advertised by the BGP peer by applying routing policies and filters to the external router.
3. If required, increase the maximum prefixes settings under the BGP neighbor configuration section.

4.0.0

Security Compliance Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Trigger NDcPP Non-Compliance Critical manager

The NSX security status is not NDcPP compliant.
When event detected: "One of the NDcPP compliance requirements is being violated. That means the NSX status is currently non-compliant with regards to NDcPP. "
When event resolved: "The NDcPP compliance issues have all been resolved. "

Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the NDcPP compliance name.

4.1.0
Trigger EAL4 Non-Compliance Critical manager

The NSX security status is not EAL4+ compliant.
When event detected: "One of the EAL4+ compliance requirements is being violated. That means the NSX status is currently non-compliant with regards to EAL4+. "
When event resolved: "The EAL4+ compliance issues have all been resolved. "

Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the EAL4+ compliance name.

4.1.0
Poll NDcPP Non-Compliance Critical manager

The NSX security configuration is not NDcPP compliant.
When event detected: "One of the NDcPP compliance requirements is being violated. That means the NSX configuration is currently non-compliant with regards to NDcPP. "
When event resolved: "The NDcPP compliance issues have all been resolved. "

Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the NDcPP compliance name.

4.1.0
Poll EAL4 Non-Compliance Critical manager

The NSX security configuration is not EAL4+ compliant.
When event detected: "One of the EAL4+ compliance requirements is being violated. That means the NSX configuration is currently non-compliant with regards to EAL4+. "
When event resolved: "The EAL4+ compliance issues have all been resolved. "

Run the compliance report from the UI Home - Monitoring & Dashboard - Compliance Report menu and resolve all the issues that are marked with the EAL4+ compliance name.

4.1.0

Service Insertion Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Service Deployment Succeeded Info manager

Service deployment succeeded.
When event detected: "The service deployment {entity_id} for service {service_name} on cluster {vcenter_cluster_id} has succeeded. "
When event resolved: "The service deployment {entity_id} on cluster {vcenter_cluster_id} has succeeded, no action needed. "

No action needed.

4.0.0
Service Deployment Failed Critical manager

Service deployment failed.
When event detected: "The service deployment {entity_id} for service {service_name} on cluster {vcenter_cluster_id} has failed. Reason : {failure_reason} "
When event resolved: "The failed service deployment {entity_id} has been removed. "

Delete the service deployment using NSX UI or API. Perform any corrective action from the KB and retry service deployment again.

4.0.0
Service Undeployment Succeeded Info manager

Service deployment deletion succeeded.
When event detected: "The deletion of service deployment {entity_id} for service {service_name} on cluster {vcenter_cluster_id} has succeeded. "
When event resolved: "The deletion of service deployment {entity_id} on cluster {vcenter_cluster_id} has succeeded, no action needed. "

No action needed.

4.0.0
Service Undeployment Failed Critical manager

Service deployment deletion failed.
When event detected: "The deletion of service deployment {entity_id} for service {service_name} on cluster {vcenter_cluster_id} has failed. Reason : {failure_reason} "
When event resolved: "The failed service deployment name {entity_id} has been removed. "

Delete the service deployment using NSX UI or API. Perform any corrective action from the KB and retry deleting the service deployment again. Resolve the alarm manually after checking all the VM and objects are deleted.

4.0.0
SVM Health Status Up Info manager

SVM is working in service.
When event detected: "The health check for SVM {entity_id} for service {service_name} is working correctly on {hostname_or_ip_address_with_port}. "
When event resolved: "The SVM {entity_id} is working correctly, no action needed. "

No action needed.

4.0.0
SVM Health Status Down High manager

SVM is not working in service.
When event detected: "The health check for SVM {entity_id} for service {service_name} is not working correctly on {hostname_or_ip_address_with_port}. Reason : {failure_reason}. "
When event resolved: "The SVM {entity_id} with wrong state has been removed. "

Delete the service deployment using NSX UI or API. Perform any corrective action from the KB and retry service deployment again if necessary.

4.0.0
Service Insertion Infra Status Down Critical esx

Service insertion infrastructure status down and not enabled on host.
When event detected: "SPF not enabled at port level on host {transport_node_id} and the status is down. Reason : {failure_reason}. "
When event resolved: "Service insertion infrastructure status is up and has been correctly enabled on host. "

Perform any corrective action from the KB and check if the status is up. Resolve the alarm manually after checking the status.

4.0.0
SVM Liveness State Down Critical manager

SVM liveness state down.
When event detected: "SVM liveness state is down on {entity_id} and traffic flow is impacted. "
When event resolved: "SVM liveness state is up and configured as expected. "

Perform any corrective action from the KB and check if the state is up.

4.0.0
Service Chain Path Down Critical manager

Service chain path down.
When event detected: "Service chain path is down on {entity_id} and traffic flow is impacted. "
When event resolved: "Service chain path is up and configured as expected. "

Perform any corrective action from the KB and check if the status is up.

4.0.0
New Host Added Info esx

New Host added in cluster.
When event detected: "New host added in cluster {vcenter_cluster_id} and SVM will be deployed. "
When event resolved: "New host added successfully. "

Check for the VM deployment status and wait till it powers on.

4.0.0

TEP Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Faulty TEP Medium esx

TEP is unhealthy.
When event detected: "TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id}. Overlay workloads using this TEP will face network outage. Reason: {vtep_fault_reason}. "
When event resolved: "TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} is healthy. "

1. Check if TEP has valid IP or any other underlay connectivity issues.
2. Enable TEP HA to failover workloads to other healthy TEPs.

4.1.0
TEP Ha Activated Info esx

TEP HA activated.
When event detected: "TEP HA activated for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id}. "
When event resolved: "TEP HA cleared for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id}. "

Enable AutoRecover or invoke Manual Recover for TEP:{vtep_name} on VDS:{dvs_name} at Transport node:{transport_node_id}.

4.1.0
TEP Autorecover Success Info esx

AutoRecover is successful.
When event detected: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} is successful. "
When event resolved: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} is cleared. "

None.

4.1.0
TEP Autorecover Failure Medium esx

AutoRecover failed.
When event detected: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} failed. Overlay workloads using this TEP will failover to other healthy TEPs. If no other healthy TEPs, overlay workloads will face network outage. "
When event resolved: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} is cleared. "

Check if TEP has valid IP or any other underlay connectivity issues.

4.1.0
Faulty TEP On DPU Medium dpu

TEP is unhealthy on DPU.
When event detected: "TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id}. Overlay workloads using this TEP will face network outage. Reason: {vtep_fault_reason}. "
When event resolved: "TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id} is healthy. "

1. Check if TEP has valid IP or any other underlay connectivity issues.
2. Enable TEP HA to failover workloads to other healthy TEPs.

4.1.0
TEP Ha Activated On DPU Info dpu

TEP HA activated on DPU.
When event detected: "TEP HA activated for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id}. "
When event resolved: "TEP HA cleared for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id}. "

Enable AutoRecover or invoke Manual Recover for TEP:{vtep_name} on VDS:{dvs_name}. at Transport node:{transport_node_id} on DPU {dpu_id}.

4.1.0
TEP Autorecover Success On DPU Info dpu

AutoRecover is successful on DPU.
When event detected: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id}. on DPU {dpu_id} is successful. "
When event resolved: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id}. on DPU {dpu_id} is cleared. "

None.

4.1.0
TEP Autorecover Failure On DPU Medium dpu

AutoRecover failed on DPU.
When event detected: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id} failed. Overlay workloads using this TEP will failover to other healthy TEPs. If no other healthy TEPs, overlay workloads will face network outage. "
When event resolved: "Auto Recover for TEP:{vtep_name} of VDS:{dvs_name} at Transport node:{transport_node_id} on DPU {dpu_id} is cleared. "

Check if TEP has valid IP or any other underlay connectivity issues.

4.1.0

Transport Node Health Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Monitoring Framework Unhealthy Medium global-manager, bms, edge, esx, kvm, manager, public-cloud-gateway

Monitoring Service framework on transport node is unhealthy.
When event detected: "Monitoring Service framework on the host with UUID {entity_id} is unhealthy for more than 5 minutes. Stats and status will not be collected from this host. "
When event resolved: "Monitoring Service framework on the host with UUID {entity_id} is healthy. "

1. On problematic nsx-edge or nsx-public-gateway node, please invoke systemctl restart nsx-edge-exporter.
2. On problematic nsx-manager or global-manager node, please invoke systemctl restart nsx-host-node-status-reporter.
3. On other problematic nodes, please invoke '/etc/init.d/nsx-exporter restart'.

4.1.1
Transport Node Uplink Down On DPU Medium dpu

Uplink on DPU is going down.
When event detected: "Uplink on DPU {dpu_id} is going down. "
When event resolved: "Uplink on DPU {dpu_id} is going up. "

Check the physical NICs' status of uplinks on DPU {dpu_id}. Find out the mapped name of this physical NIC on host, then perform checking on UI.
1. In the NSX UI navigate to Fabric | Nodes | Transport Nodes | Host Transport Nodes.
2. In the Host Transport Nodes list, check the Node Status column. Find the Transport node with the degraded or down Node Status.
3. Select <transport node> | Monitor. Check the status details of the bond(uplink) which is reporting degraded or down. To avoid a degraded state, ensure all uplink interfaces are connected and up regardless of whether they are in use or not.

4.0.0
LAG Member Down On DPU Medium dpu

LACP on DPU reporting member down.
When event detected: "LACP on DPU {dpu_id} reporting member down. "
When event resolved: "LACP on DPU {dpu_id} reporting member up. "

Check the connection status of LAG members on DPU {dpu_id}. Find out the mapped name of related physical NIC on host, then perform checking on UI.
1. In the NSX UI navigate to Fabric | Nodes | Transport Nodes | Host Transport Nodes.
2. In the Host Transport Nodes list, check the Node Status column. Find the Transport node with the degraded or down Node Status.
3. Select <transport node> | Monitor. Find the bond(uplink) which is reporting degraded or down.
4. Check the LACP member status details by logging into the failed DPU {dpu_id} and invoking esxcli network vswitch dvs vmware lacp status get.

4.0.0
NVDS Uplink Down (deprecated) Medium esx, kvm, bms

Uplink is going down.
When event detected: "Uplink is going down. "
When event resolved: "Uplink is going up. "

Check the physical NICs' status of uplinks on hosts.
1. In the NSX UI navigate to Fabric | Nodes | Transport Nodes | Host Transport Nodes.
2. In the Host Transport Nodes list, check the Node Status column. Find the Transport node with the degraded or down Node Status.
3. Select <transport node> | Monitor. Check the status details of the bond(uplink) which is reporting degraded or down. To avoid a degraded state, ensure all uplink interfaces are connected and up regardless of whether they are in use or not.

3.0.0
Transport Node Uplink Down Medium esx, kvm, bms

Uplink is going down.
When event detected: "Uplink is going down. "
When event resolved: "Uplink is going up. "

Check the physical NICs' status of uplinks on hosts.
1. In the NSX UI navigate to Fabric | Nodes | Transport Nodes | Host Transport Nodes.
2. In the Host Transport Nodes list, check the Node Status column. Find the Transport node with the degraded or down Node Status.
3. Select <transport node> | Monitor. Check the status details of the bond(uplink) which is reporting degraded or down. To avoid a degraded state, ensure all uplink interfaces are connected and up regardless of whether they are in use or not.

3.2.0
LAG Member Down Medium esx, kvm, bms

LACP reporting member down.
When event detected: "LACP reporting member down. "
When event resolved: "LACP reporting member up. "

Check the connection status of LAG members on hosts.
1. In the NSX UI navigate to Fabric | Nodes | Transport Nodes | Host Transport Nodes.
2. In the Host Transport Nodes list, check the Node Status column. Find the Transport node with the degraded or down Node Status.
3. Select <transport node> | Monitor. Find the bond(uplink) which is reporting degraded or down.
4. Check the LACP member status details by logging into the failed host and invoking esxcli network vswitch dvs vmware lacp status get on an ESXi host or ovs-appctl bond/show and ovs-appctl lacp/show on a KVM host.

3.0.0

Transport Node Pending Action Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Maintenance Mode Critical manager

The host has pending user actions i.e. PENDING_HOST_MAINTENANCE_MODE.
When event detected: "Host {host_name} - {host_uuid} has PENDING_HOST_MAINTENANCE_MODE user action. It means high performance configuration is not yet realized on the host. "
When event resolved: "The host {host_name} - {host_uuid} no longer has PENDING_HOST_MAINTENANCE_MODE user action. "

Move host {host_name} - {host_uuid} to maintenance mode from vCenter. This will start realization of high performance configuration on the host. If processed successfully, transportNodeState will no longer have PENDING_HOST_MAINTENANCE_MODE inside pending_user_actions field. If realization of high performance configuration on the host fails, then transportNodeState will be updated with the failure message and the host will no longer be in pending maintenance mode.

4.1.2

VMC App Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
Transit Connect Failure Medium manager

Transit Connect fails to be fully realized.
When event detected: "Transit Connect related configuration is not fully correctly realized. Possible issues could be failing to retrieve provider information or some transient provider communication error. "
When event resolved: "Transit Connect failure is remediated. "

If this alarm is not auto-resolved within 10 minutes, retry the most recent transit connect related request(s). For example, if a TGW attachment API request triggered this alarm, retry the TGW attachment API request again. If alarm does not resolve even after retry, then try the following steps:
1. Check if the task keeps failing, or the task has recovered. a) Identify leader Manager node. After logging into one of the nodes, run command: - su admin - get cluster status verbose This will show the leader Manager node b) Log in to NSX leader Manager node. check vmc-app.log on the NSX leader Manager node: - tail -f /var/log/proton/vmc-app.log c) Check logs for following prints - If any of these error messages keeps showing every two mins, that means task keeps failing. - Failed to get TGW route table for []. Error: [] - Failed to get TGW routes for attachment [] in route table []. Error - Failed to get TGW attachment VPC ID for []. Error: [] - Failed to get TGW attachment resource ID for []. Error: Unknown resource type - Failed to get TGW attachments for TGW []. Error: [] - Failed to get local TGW attachment []. Error: [] - Failed to find correct TgwAttachment state in AWS, state: [], skipping TGW route update task - TGW attachment [] is not associated with any route table - No local TGW SDDC attachment found for []
2. Check if all AWS calls from NSX Manager failed, on leader Manager node. Run following command: - export HTTP_PROXY=http://<pop ip>:3128 - export HTTPS_PROXY=http://<pop ip>:3128 - export NO_PROXY=169.254.169.254 - aws ec2 describe-instances --region <region> If aws command failed with error, then there might be a system issue in HTTP reverse proxy configuration on pop, or there is AWS service side issue.
3. Check whether TGW attachment still exists in AWS. a) TGW attachment ID could be found with GET cloud-service/api/v1/infra/associated-groups - aws ec2 describe-transit-gateway-attachments --region <region> --transit-gateway-attachment-id <TGW attachment ID> If TGW attachment has been deleted, contact VMware Support, share SDDC ID and TGW attachment ID. After VMware support team identified the issue, manually delete the object left behind, if needed. b) Check if this TGW attachment exists on AWS console. c) Another option is logging in to NSX Manager, using aws command to check the state of TGW attachment: - aws ec2 describe-transit-gateway-attachments --region <region> --transit-gateway-attachment-id <TGW attachment ID>

4.1.0
Traffic Group Prefix List Deletion Failure High manager

Failure in deletion of Traffic Group Prefix list.
When event detected: "When prefix list mode for connected VPC is activated, customer can choose to program all connected VPC prefix lists (this includes multi-edge prefix list as well) in non-main route tables of their choice. By policy, SDDC software would not add/delete prefix lists into non-main routing table. Lets say customer has programmed a traffic group prefix list in connected VPC non-main routing tables. Then after some time, customer chooses to delete a traffic group. When a traffic group is deleted, SDDC software could not delete the TG prefix list because TG prefix list is programmed into non-main routing table. In some cases, the TG prefix list can be programmed in multiple non-main route tables at a time. So an alarm would be raised to notify SRE team. SRE would notify customer to delete prefix list from non-main routing table. Once deleted from routing table, prefix list can be deleted by SDDC software. "
When event resolved: "Traffic Group Prefix list deletion failure is remediated. "

If this alarm is not auto-resolved within 10 minutes, then execute the following steps:
1. Navigate to the 'Connected VPC' tab under 'Networking' in NSX manager UI. Please note down the prefix list id and route table Ids for all prefix lists in 'Waiting for Deletion' state.
2. Ask the customer to delete the prefix lists identified in 1 from all route tables identified in 1.
3. Once customer removes all prefix list Ids from all connected VPC route tables, refresh the 'Connected VPC' UI page. The entry should no longer be seen. If the entry is removed from UI, remediation is done and no more action is required.

4.1.2
Prefix List Capacity Issue Failure High manager

Prefix list capacity issue failure.
When event detected: "VMC App cannot program AWS managed prefix list with route/prefix because number of entries in AWS managed prefix list has reached size of the prefix list. "
When event resolved: "Capacity issue with Prefix list is remediated. "

1. Run API GET 'cloud-service/api/v1/infra/sddc/provider-resource-info?resource_type=managed_prefix_list' to get a list of all prefix lists from SDDC. a) Check the 'state' and 'status_message' of each prefix list in API output. b) If the state of any prefix list is 'modify-failed' and status message has the string 'The following VPC Route Table resources do not have sufficient capacity' then the prefix list has run into resizing failure. The 'status-message' is going to specify what route table Ids have to be increased in size. c) If the API output contains 'issues' field, it would specify what routes are missing from the managed prefix list. Calculate number of routes missing from 'issues' field. d) File a AWS ticket to increase size of the routing table identified in (b) by atleast minimum size identified in (c). e) After AWS increased the route table limit, wait for atleast 1 hour and then invoke API' GET 'cloud-service/api/v1/infra/sddc/provider-resource-info?resource_type=managed_prefix_list'. Make sure 'state' of any of the prefix list is not 'modify-failed'.

4.1.2
Prefix List Resource Share Customer Failure Medium manager

Failure with prefix list resource share.
When event detected: "This issue would occur when customer accidentally or intentionally clicks on 'Leave resource share' in the customer account. "
When event resolved: "Failure with prefix list resource share is remediated. "

If this alarm is not auto-resolved within 10 minutes, then execute the following steps:
1. Get the current resource share ARN that is shared with the customer connected VPC account by running the following command. a) curl -ik -X GET https://<nsx-mgr-vip>/cloud-service/api/v1/infra/linked-vpcs resource-share-arn this is the resource share which is shared with customer account. Make sure it is in ACTIVE state. There should not be a case where its not active. b) Run the following AWS command to get the current status of the resource share. Use the 'resource-share-arn' fetched in step1. The AWS CLI command can either be run from NSX manager console or any other platform where AWS CLIs can be run. -/usr/local/bin/aws ram get-resource-share-associations --association-type PRINCIPAL --resource-share-arns arn <resource-share-arn> --region <region> -resource-share-arn - fetched in a) -region - AWS region where SDDC is deployed. -If the resource share status is ASSOCIATED, then nothing needs to be done. -If the resource share status = DISASSOCIATED, then issue needs to be remediated as follows: -Inform the customer that 'Leave resource share' has been specified. -If customer does not want to use prefix list mode again, then ask the customer to deactivate prefix mode from connected VPC UI. -If the customer intent is for it to be remediated and still use prefix list mode, ask the customer to deactivate prefix list mode from the UI. After this is successful, ask the customer to activate prefix list mode again.

4.1.2
Resource Share Sanity Check Failure High manager

Failure in resource share check.
When event detected: "Customer activates prefix list mode for connected VPC from connected VPC page in networking tab of NSX manager UI. After this step, customer would be prompted to accept resource share if prefix list mode needs to be activated. If the customer does not want to activate prefix list mode, the customer could deactivate prefix list mode from connected VPC account. If the customer does not accept or reject the resource share after more than 24 hours, then this alarm would be raised. "
When event resolved: "Resource share check failure is remediated. "

If this alarm is not auto-resolved within 10 minutes, then execute the following steps:
1. Get the current resource share ARN that is shared with the customer connected VPC account by running the following command. a) curl -ik -X GET https://<nsx-mgr-vip>/cloud-service/api/v1/infra/linked-vpcs resource-share-arn this is the resource share which is shared with customer account. Make sure it is in ACTIVE state. There should not be a case where its not active. b) Run the following AWS command to get the current status of the resource share. Use the 'resource-share-arn' fetched in step1. The AWS CLI command can either be run from NSX manager console or any other platform where AWS CLIs can be run. -export HTTPS_PROXY=http://<pop ip>:3128 -/usr/local/bin/aws ram get-resource-share-associations --association-type PRINCIPAL --resource-share-arns arn <resource-share-arn> --region <region> -resource-share-arn - fetched in a) -If customer wants to activate prefix list mode, then ask the customer to accept the resource share. -If customer wants to deactivate resource share, then ask the customer to reject resource share. -Once customer performs either c or d, then wait for 5 to 10 mins. Then refresh the connected VPC UI page. -The prefix list mode for connected VPC should either be in activated state or deactivated state based on customer action.

4.1.2
TGW Get Attachment Failure High manager

Failure in fetching TGW attachment.
When event detected: "Background TGW routes update task failed to get TGW attachment related info. Possibilities of hitting this alert is very low, except regression in service or AWS side. We have added this alarm to identify the issue before customer will notice any connectivity issue. "
When event resolved: "Failure with getting TGW attachment is remediated. "

1. Log in to nsx manager. There are three manager nodes, we will need to find the leader node After logging in to one node, run command: -su admin -get cluster status verbose -Find out the TGW Leader node.
2. Please check if all AWS calls from nsx manager failed. a) Log in to nsx manager leader node and run following command: -export HTTPS_PROXY=http://<pop ip>:3128 -aws ec2 describe-instances --region <region> b) If aws command failed with error, then there might be a system issue in HTTP reverse proxy configuration on pop, or there is AWS service side issue.
3. Please check whether TGW attachment still exists in AWS. a) TGW attachment ID could be found by running following API 'GET cloud-service/api/v1/infra/associated-groups'. tgw_attachment_id -> TGW attachment id. b) Run the following CLI from NSX manager console: -aws ec2 describe-transit-gateway-attachments --region <region> --transit-gateway-attachment-id <TGW attachment ID> c) If TGW attachment does not exist, there will be error message 'An error occurred (InvalidTransitGatewayAttachmentID.NotFound) when calling the DescribeTransitGatewayAttachments operation: Transit Gateway Attachment tgw-attach-0db05afa627b82f08 was deleted or does not exist.' d) If TGW attachment has been deleted, please contact Skynet team [email protected] (Srikanth Garimella). Please share SDDC ID and TGW attachment ID.

4.1.2
TGW Attachment Mismatch Failure High manager

Failure due to mismatch of TGW attachments.
When event detected: "Background TGW routes update task failed to get TGW attachment related info. Possibilities of hitting this alert is very low, except regression in service or AWS side. We have added this alarm to identify the issue before customer will notice any connectivity issue. "
When event resolved: "Failure due to mismatch of TGW attachments is remediated. "

1. Log in to nsx manager. There are three manager nodes, we will need to find the leader node After logging in to one node, run command: -su admin -get cluster status verbose -Find out the TGW Leader node.
2. Please check if all AWS calls from nsx manager failed. a) Log in to nsx manager leader node and run following command: -export HTTPS_PROXY=http://<pop ip>:3128 -aws ec2 describe-instances --region <region> b) If aws command failed with error, then there might be a system issue in HTTP reverse proxy configuration on pop, or there is AWS service side issue.
3. Please check whether TGW attachment still exists in AWS. a) TGW attachment ID could be found by running following API 'GET cloud-service/api/v1/infra/associated-groups'. tgw_attachment_id -> TGW attachment id. b) Run the following CLI from NSX manager console: -aws ec2 describe-transit-gateway-attachments --region <region> --transit-gateway-attachment-id <TGW attachment ID> c) If TGW attachment does not exist, there will be error message 'An error occurred (InvalidTransitGatewayAttachmentID.NotFound) when calling the DescribeTransitGatewayAttachments operation: Transit Gateway Attachment tgw-attach-0db05afa627b82f08 was deleted or does not exist.' d) If TGW attachment has been deleted, please contact Skynet team [email protected] (Srikanth Garimella). Please share SDDC ID and TGW attachment ID.

4.1.2
TGW Route Table Max Failure High manager

TGW Route table max entries failure.
When event detected: "TGW route capacity limit is reached which results in failure. "
When event resolved: "TGW Route table max entries failure is remediated. "

1. Login to NSX manager UI in 'Networking & Security' tab. Then navigate to 'transit connect' tab.
2. Check if 'learned routes' page contains a route with 'Failure' status. Check if the failure is due to reaching route table limits.
3. Login to ESX host and run 'vmc-cli -s'. Note down the onprem table id and egress route table id.
4. If failure is due to route table limits, then do the following: a) Check the number of route failures due to route table limits in 'learned routes' tab. This is the minimum number to which the route table limits should be increased. Lets call this threshold. b) Create AWS support request to increase the VPC route table limit to a minimum of threshold value. Recommendation is to increase the limits to more than the threshold. The VPC route table can only be increased upto 1000 entries(AWS hard limit). c) Once the AWS route table limits are increased, check 'learned routes' tab to find out if all the failures related to route table limits are eliminated.

4.1.2
TGW Route Update Failure High manager

TGW Route update fails due to wrong TGW attachment size.
When event detected: "Transit Connect related configuration is not fully correctly realized. Possible issues could be: -Background TGW routes update task ran into issue. -Stale associated-groups object will cause failure in TGW route task. SDDC route advertisement and learning will be stuck which will have connectivity issue. "
When event resolved: "TGW Route update failure is remediated. "

1. Run the following API 'GET /cloud-service/api/v1/infra/associated-groups'. The number of associated groups should only return 1 or 0. a) If the above API returns more than 1 associated groups, then do the following -Login to VMC UI and navigate to 'SDDC Groups' tab. -Find the correct SDDC group which contains this SDDC by checking each group members if it contains the SDDC under question. -Remove stale associations by running following API 'DELETE /cloud-service/api/v1/infra/associated-groups/<association-id>'.

4.1.2
TGW Tagging Mismatch Failure High manager

Failure due to mismatch of TGW tags.
When event detected: "For prefix list managed routes, source and region info need to be retrieved from prefix list tags. If tagging mismatch happens, check the tagging format below and create a Jira ticket to Skynet with detailed error messages and missing tags. "
When event resolved: "Failure due to mismatch of TGW tags is remediated. "

If this alarm is not auto-resolved within 10 minutes, then execute the following steps:
1. Find out the number of TGW prefix lists in VMC-APP. Please follow below steps: a) Run the API 'GET cloud-service/api/v1/infra/associated-groups'. Please note down all the TGW prefix list Ids listed under 'aws_prefix_list'. b) For each prefix list obtained in the above API, check if the managed prefix list contains all the required tags. Please check whether there are missing tags in AWS prefix list. Each attachment type requires different set of tags. All the tags required for each attachment type is mentioned below: SDDC SDDC_ID: <SDDC id> RESOURCE_ID: <SDDC id> RESOURCE_TYPE: SDDC GROUP_ID: <group id> RESOURCE_REGION: <SDDC region> VPC RESOURCE_ID: <vpc id> RESOURCE_TYPE: VPC GROUP_ID: <group id> RESOURCE_REGION: <VPC attachment region> CUSTOMER_TGW RESOURCE_ID: <TGW id> RESOURCE_TYPE: CUSTOMER_TGW GROUP_ID: <group id> RESOURCE_REGION: <TGW attachment region> c) If any tag is missing from any prefix list, Please contact [email protected] and Srikanth Garimella.Please provide SDDC group ID, prefix list ID and its region.
2. After Skynet team identified/confirmed that issue is valid, manually add correct tags using AWS console.

4.1.2

VPN Events

Event Name Severity Node Type Alert Message Recommended Action Release Introduced
IPsec Service Down Medium edge, autonomous-edge, public-cloud-gateway

IPsec service is down.
When event detected: "The IPsec service {entity_id} is down. Reason: {service_down_reason}. "
When event resolved: "The IPsec service {entity_id} is up. "

1. Deactivate and reactivate the IPsec service from NSX Manager UI.
2. If the issue still persists, check syslog for error logs and contact VMware support.

3.2.0
IPsec Policy Based Session Down Medium edge, autonomous-edge, public-cloud-gateway

Policy based IPsec VPN session is down.
When event detected: "The policy based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}. "
When event resolved: "The policy based IPsec VPN session {entity_id} is up. "

Check IPsec VPN session configuration and resolve errors based on the session down reason.

3.0.0
IPsec Route Based Session Down Medium edge, autonomous-edge, public-cloud-gateway

Route based IPsec VPN session is down.
When event detected: "The route based IPsec VPN session {entity_id} is down. Reason: {session_down_reason}. "
When event resolved: "The route based IPsec VPN session {entity_id} is up. "

Check IPsec VPN session configuration and resolve errors based on the session down reason.

3.0.0
IPsec Policy Based Tunnel Down Medium edge, autonomous-edge, public-cloud-gateway

policy Based IPsec VPN tunnels are down.
When event detected: "One or more policy based IPsec VPN tunnels in session {entity_id} are down. "
When event resolved: "All policy based IPsec VPN tunnels in session {entity_id} are up. "

Check IPsec VPN session configuration and resolve errors based on the tunnel down reason.

3.0.0
IPsec Route Based Tunnel Down Medium edge, autonomous-edge, public-cloud-gateway

Route based IPsec VPN tunnel is down.
When event detected: "The route based IPsec VPN tunnel in session {entity_id} is down. Reason: {tunnel_down_reason}. "
When event resolved: "The route based IPsec VPN tunnel in session {entity_id} is up. "

Check IPsec VPN session configuration and resolve errors based on the tunnel down reason.

3.0.0
L2Vpn Session Down Medium edge, autonomous-edge, public-cloud-gateway

L2VPN session is down.
When event detected: "The L2VPN session {entity_id} is down. "
When event resolved: "The L2VPN session {entity_id} is up. "

Check L2VPN session status for session down reason and resolve errors based on the reason.

3.0.0

Back to: VMware NSX Documentation

Scroll to top icon