Management Pack for NSX for vSphere Alert Definitions

Controller Alert Definitions

The following alert definitions are defined on the Controller objects.

Table 1. Controller Alert Definitions
Alert Name	Symptoms	Recommendations	Impact	Severity
Controller is down	Fault: nsx.event.controller.status.unknown	Verify that the NSX Controller virtual machine is powered on. Verify that the NSX Controller virtual machine is connected to the network.	Health	Critical
Controller resource usage is high	Metric: CPU\|Usage (%) above 80% Metric: Memory\|Usage (%) above 80% Metric: Disk\|Usage (%) above 80% Metric: Network\|Usage (%) above 80%	Investigate CPU and memory utilization on the NSX Controller to determine if there is an issue. Add additional NSX Controllers to the Cluster.	Health	Warning
No syslog server is configured	Property: Configuration\|Syslog server = none	Configure a syslog server on the NSX Controller.	Risk	Immediate
The Controller VM has been removed from the vCenter	Fault: nsx.event.controller.vm.status.deleted	Redeploy the controller.	Health	Critical

Controller Cluster Alert Definitions

The following alert definitions are defined on the Controller Cluster objects

Table 2. Controller Cluster Alert Definitions
Alert Name	Symptoms	Recommendations	Impact	Severity
No cluster majority can be established	Metric: Status\|Active (%) <50	Verify that all NSX Controller virtual machines are powered on. Verify that all NSX Controller virtual machines are connected to the network.	Health	Critical
Less than three controllers are active	Metric: Status\|Active Controllers is less than 3	Verify that all NSX Controller VMs are powered on. Verify that all NSX Controller VMs are connected to the network.	Risk	Immediate
Less than three controllers are deployed	Metric: Status\|Controllers is less than 3	Verify that all NSX Controller VMs are powered on. Deploy at least 3 controllers in the cluster.	Risk	Immediate
All Controller VMs are deployed on the same host	Fault: nsx.event.controller.cluster.all.on.same.host	Move the NSX Controller nodes to different hosts.	Risk	Warning

Manager Alert Definitions

The following alert definitions are defined on the Manager objects.

Table 3. Manager Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
Manager resource usage is high	Metric: CPU\|Usage (%) above 80% Metric: Memory\|Usage (%) above 80% Metric: Disk\|Usage (%) above 80%	Investigate CPU and memory utilization on the NSX Manager to determine if there is an issue.	Health	Warning
vCenter inventory connection has been lost	Property: Status\|vCenter Connection Status = Disconnected	Verify that there is network connectivity between the NSX Manager and vCenter. Verify that the vCenter configuration is correct in the NSX Manager.	Health	Critical
No backup of the environment has been recorded	Property: Status\|Last Backup Time = None	Start a backup of the environment from the NSX Manager.	Risk	Immediate
Scheduled backups are not enabled	Property: Configuration\|Backup Scheduled = false	Configure scheduled backups of the environment from the NSX Manager.	Risk	Immediate
Manager API calls are failing	Fault: nsx.event.manager.api.non.responsive	Verify that the NSX Manager virtual machine is powered on. Verify that the NSX Manager virtual machine is connected to the network. Verify that all services are running on the NSX Manager.	Health	Critical
VXLAN segment range has been exhausted	Metric: VXLAN\|Usage (%) = 100	Add additional logical segments to the Transport Zone.	Risk	Warning
NSX Manager is violating NSX Hardening Guide	Property: config\|backup\|transfer_protocol != SFTP Property: firewall\|layer3\|default_rule = allow Property: config\|backup\|audit_logs_excluded = true Property: network\|ipv6_address != none Property: config\|ntp_servers = none Property: config\|syslog_server = none Property: config\|backup\|sys_events_excluded = true Fault: nsx.event.manager.pwd.not.changed Metric: service:ssh service\|enabled = true Property: network\|ipv6_dns_servers != none Property: controller\|ipSecEnabled != true	Fix the violations against NSX Hardening Guide Rules as per the recommendations in the NSX Hardening Guide.	Risk	Warning
The RabbitMQ service is not running	Metric: Service:rabbitmq\|Enabled is true Metric: Service:rabbitmq\|Status is not RUNNING	Log on to the CLI for the NSX Manager and restart any services that are down. Restart the NSX Manager.	Health	Immediate
The vPostgres service is not running	Metric: Service:vpostgres\|Enabled is true Metric: Service:vpostgres\|Status is not RUNNING	Log on to the CLI for the NSX Manager and restart any services that are down. Restart the NSX Manager.	Health	Immediate
The Management service is not running	Metric: Service:nsx manager\|Enabled is true Metric: Service:nsx manager\|Status is not RUNNING	Log on to the CLI for the NSX Manager and restart any services that are down. Restart the NSX Manager.	Health	Immediate
The Replicator service is not running	Metric: Service:nsx replicator\|Enabled is true Metric: Service:nsx replicator\|Status is not RUNNING Metric: Configuration\|Role is PRIMARY	Log on to the CLI for the NSX Manager and restart any services that are down. Restart the NSX Manager.	Health	Immediate
NSX Manager is down	API calls are failing. Either the VM that the manager is running is powered off, or it is not responding to a ping.	Verify that the NSX Manager VM is powered on. Verify that the NSX Manager VM is connected to the network.	Health	Critical

NSX Edge Alert Definitions

The following alert definitions are defined on the NSX Edge objects.

Table 4. Edge Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
Edge resource usage is high	Metric: CPU\|Usage (%) above 80% Metric: Memory\|Usage (%) above 80% Metric: Disk\|Usage Rate (%) above 80% Metric: Network\|Usage (%) above 80%	Investigate CPU and memory utilization on the NSX Edge to determine if there is an issue. Scale up the NSX Edge to a larger size	Health	Warning
Edge is not highly available	Property: Status\|HA Status = Down Property: Status\|HA Status = Unstable Property: Status\|HA Status = Misconfiguration	Enable high availability on the NSX Edge. Verify that the virtual machines are running on different hosts. Verify that both virtual machines are running.	Risk	Immediate
High availability is not enabled on Edge	Property: Status\|HA Status = Off	Enable High Availability on the NSX Edge.	Risk	Immediate
One or more Edge interfaces are down	Metric: Interface\|Status = down	Check the admin status of the interfaces on the NSX Edge.	Health	Critical
One or more Edges in the ECMP Cluster are down	Metric: Edge\|Active (%) > 50 Metric: Edge\|Active (%) < 100	Verify that the VM for each NSX Edge is running. Verify that the VM for each NSX Edge is connected to the network	Health	Warning
All Edges in the ECMP Cluster are down	Metric: Edge\|Active (%) = 0	Verify that the VM for each NSX Edge is running. Verify that the VM for each NSX Edge is connected to the network	Health	Critical
Edge VM is not responding to health check	Fault: nsx.event.edge.vm.not.responding.to.health.check	Restart the virtual machine for thisNSX Edge.	Health	Critical
Edge is not deployed	Fault: nsx.event.gatewayservice.status.unknown Metric: Status\|Status = undeployed	Check for errors that occurred during deployment of the NSX Edge . Redeploy the NSX Edge .	Health	Critical
All of the Edge VMs are powered off	Metric: Status\|Running = 0	Power on at least one of the virtual machines	Health	Critical
Edge API calls are failing	Fault: nsx.event.edge.gateway.api.failure	Investigate the log files for more information. Redeploy the NSX Edge .	Health	Critical
Edge VM is not responding to health check	Fault: nsx.event.edge.vm.not.responding.to.heath.check	Restart the virtual machine for thisNSX Edge.	Health	Critical
A firewall, NAT, load balancer, or VPN service is running on this NSX Edge with ECMP enabled	Child (Firewall Edge Service) Metric: Status\|Service Status is not DOWN Child (Load Balancer Edge Service) Metric: Status\|Service Status is not DOWN Child (IPSec VPN Edge Service) Metric: Status\|Service Status is not DOWN Child (L2 VPN Edge Service) Metric: Status\|Service Status is not DOWN Child (SSL VPN Edge Service) Metric: Status\|Service Status is not DOWN Child (NAT Edge Service) Metric: Status\|Service Status is not DOWN	Disable all Stateful Services (Firewall, NAT, Load Balancer, and VPN) on this NSX Edge.	Health	Warning
The MTU of one or more interfaces does not match the next hop router	Metric: Interface\|MTU Mismatch is true	Configure the same MTU on all routes.	Health	Warning
One or more OSPF neighbors are not in the full state	The alert condition will be triggered using the OSPF Neighbor Down event from Log Insight. Metric: route\|ospf\|adjacency\|status = down	Check the status of all OSPF neighbors. Manually clear this alert once the issue is resolved.	Health	Immediate
One or more BGP neighbors are down	The alert condition will be triggered using following event from Log Insight: VMW_NSX_BGP Neighbor Down	Check the status of all BGP neighbors. Manually clear this alert once the issue is resolved.	Health	Immediate
Network utilization on the high availability interface is high	The alert condition will be triggered using the following event from Log Insight: Metric: HA Interface\|Total Traffic(KBps) > 10000	Ensure that there is no virtual traffic at the high availability interface.	Risk	Immediate
NSX Edge Services Gateway is violating the NSX Hardening guide	Property: service\|ssh\|status = RUNNING	Fix the violations against NSX Hardening Guide Rules as per the recommendations in the NSX Hardening Guide.	Risk	Warning

Logical Router Alert Definitions

The following alert definitions are defined on the Logical Router objects.

Table 5. Logical Router Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
Interface to OSPF area mapping configuration missing or incomplete.	Fault: nsx.event.logical.router.no.neighbors. relations	Verify the dynamic routing protocol configuration on the NSX Logical Router and physical routes.	Health	Immediate
The backing port group has been removed from vCenter	Fault: nsx.event.logical.switch.port.group.removed	Redeploy the logical switch	Health	Critical
One or more Logical Router interfaces are down	Metric: Interface\|Status = down	Check the admin status of the interfaces on the NSX Logical Router	Health	Critical
Logical Router is not deployed	Fault: nsx.event.logical.router.status.unknown	Check for errors that occurred during deployment of the NSX Logical Router. Redeploy the NSX Logical Router.	Health	Critical
Logical Router does not have an uplink interface configured	Fault: nsx.event.\|router.no.connected.uplink.iface	Check if the router configuration is only for routing between internal networks or external access is required. If external access is required then configure an uplink interface on the NSX Logical Router.	Health	Warning
Number of learned routes is below normal	Metric: route\|ospf\|used is below dynamic threshold Metric: route\|bgp\|used is below dynamic threshold Metric: route\|connected\|used is below dynamic threshold	Run the "Check routing configuration" action and verify that the current routing table is correct	Risk	Warning
One or more OSPF areas are using insecure authentication	Fault: OSPF Area\|Authentication Type!=MD5	Configure the Logical Router to use MD5 authentication for all OSPF areas.	Risk	Immediate
Logical Router is deployed to the same host as one or more ECMP Edges	Fault: nsx.event.lrouter.deployed.on.ecmp.edge.host	Move the virtual machines for this Logical Router to a different host.	Risk	Immediate
The MTU of one or more interfaces does not match the next hop router	Metric: Interface\|MTU Mismatch is true	Configure the same MTU on all routes.	Health	Warning
One or more OSPF neighbors are down	The alert condition will be triggered using the OSPF Neighbor Down event from LogInsight. Metric: route\|ospf\|adjacency\|status = down	Check the status of all OSPF neighbors. Manually clear this alert once the issue is resolved.	Health	Immediate
One or more BGP neighbors are down	The alert condition will be triggered using following event from LogInsight: VMW_NSX_BGP Neighbor Down	Check the status of all BGP neighbors. Manually clear this alert once the issue is resolved.	Health	Immediate
NSX Logical Router is violating NSX Hardening guide	Metric: area\|auth_type != md5 Property: config\|ospf\|enabled = true Property: service\|ssh\|status = RUNNING	Fix the violations against NSX Hardening Guide Rules as per the recommendations in the NSX Hardening Guide.	Risk	Warning

Host System Alert Definitions

The following alert definitions are defined on the Host System objects.

Table 6. Host System Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
Host's NSX messaging infrastructure is reporting an issue	Fault: nsx.event.hostssystem.message.infra.status.down	Verify that the host is connected to the management network. Verify that the NSX message queue service (vsfwd) is running on the host. Verify that the RabbitMQ service is running on the NSX Manager.	Health	Immediate
Distributed Firewall CPU usage is high	Fault: nsx.event.firewall.cpu.above.threshold	Investigate the network utilization of all virtual machines on this host and migrate those with high traffic in order to reduce the load on the firewall.	Health	Immediate
Distributed Firewall memory usage is high	Fault: nsx.event.firewall.mem.above.threshold	Investigate the network utilization of all virtual machines on this host and migrate those with high traffic in order to reduce the load on the firewall.	Health	Immediate
Distributed Firewall connection rate is high	Fault: nsx.event.firewall.conn.rate.above.threshold	Investigate the network utilization of all virtual machines on this host and migrate those with high traffic in order to reduce the load on the firewall.	Health	Immediate
A duplicate IP address was found for one or more physical NICs on this host	Fault: nsx.event.hostsystem.ip.conflict.exists	Reconfigure the physical NIC with an IP address that is unique on the network.	Health	Warning
The MTU on one or more physical NICs is less than 1600	Fault: nsx.event.hostsystem.mtu.unexpected	Set the MTU on each physical NIC to at least 1600.	Health	Warning
The VTEP VMK is configured with a static IP address that is not known to the NSX Manager	Fault: nsx.event.vtep.vnic.ip.not.in.pool	Configure the VTEP VMK with a static IP address from the VTEP IP pool on the NSX Manager.	Health	Warning
The network configuration of the VTEP VMK does not match the VXLAN configuration in NSX	Fault: nsx.event.vtep.vnic.misconfigured	Modify the IP configuration of the VTEP VMK to match the VXLAN configuration in NSX Manager.	Health	Warning
There is a communication issue between the host and the NSX Manager which may cause network configuration to become out of sync	Fault: nsx.event.firewall.agent.connection.down Fault: nsx.event.controlplane.agent.connection.down Manager - Host Communication Errors" event from LogInsight	Check the network connection between the host and the manager.	Health	Warning
There is a communication issue between the host and the NSX Controller, which may cause network configuration to become out of sync	Fault: nsx.event.host.controller.connection.down	Check the network connection between the host and the controller.	Health	Warning
Failed to create VXLAN interface	The alert condition will be triggered using any of the following events from LogInsight: Failed to create VTEP interface VXLAN tcp/ip stack not created	Investigate the status of VXLAN installation in NSX. Manually clear this alert once the issue is resolved.	Health	Critical
An error occurred while loading VXLAN module	The alert condition will be triggered using the following events from LogInsight: Failed to create VTEP interface VXLAN tcp/ip stack not created	Investigate the status of VXLAN installation in NSX. Manually clear this alert once the issue is resolved.	Health	Critical
Lost connection to NSX Controller	The alert condition will be triggered using the "VXLAN dataplane lost connection to controller" event from LogInsight.	Check the network connectivity between the host and the NSX Controller. Verify the network control plane agent on the host is running. Manually clear this alert once the issue is resolved.	Health	Critical
Distributed routing configuration is out of sync	The alert condition will be triggered using the following events from LogInsight: Failed to create control plane socket Failed to create/delete a routing related object	Check the network connectivity between the Logical Router and the host. Reboot the host. Manually clear this alert once the issue is resolved.	Health	Critical
Distributed firewall error occurred	The alert condition will be triggered using the following events from LogInsight: Firewall critical errors Firewall Service Profile errors Filter Config errors Dataplane incompatible with ESX version	Verify that the firewall agent and module on the host are installed and running. Manually clear this alert once the issue is resolved.	Health	Warning
Spoofguard error occurred	The alert condition will be triggered using the "Spoofguard errors by severity" event from LogInsight.	Verify that Spoofguard module is installed and running on the host. Manually clear this alert once the issue is resolved.	Risk	Warning
Logical network bridging configuration error occurred	The alert condition will be triggered using the following events from LogInsight: Failed to delete a bridge instance Attempt to add a bridge LIF to a non-existing Logical Router Bridge Delete Errors Bridge Create Errors Bridge Config Errors Attempt to a add bridge LIF failed	Check network connectivity between the host and the NSX Controller. Verify that the logical routing/bridging module on the host is installed and running. Manually clear this alert once the issue is resolved.	Health	Immediate
The VTEP VMK is configured with a subnet mask that is not known to NSX Manager	The alert condition will be triggered using a fault of score 25 that is raised on the host when this condition is detected.	Configure the VTEP VMK with the same subnet mask as the VTEP IP pool on the NSX Manager.	Health	Warning
The VTEP VMK is configured with an MTU that is not known to NSX Manager	The alert condition will be triggered using a fault of score 25 that is raised on the host when this condition is detected.	Configure the VTEP VMK with the same MTU as what the host was prepared with.	Health	Warning
The VTEP VMK on the host has been deleted	The alert condition will be triggered using the following event from LogInsight: Fault: nsx.event.vtep.vmk.on.host.missing	Prepare the host for VXLAN traffic in order to recreate the VTEP.	Health	Critical

Virtual Machine Alert Definitions

The following alert definitions are defined on the Virtual Machine objects.

Table 7. Virtual Machine Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
Virtual machine IP addresses is not in the same subnet as the logical router	Fault: nsx.event.vm.vnic.ip.not.in.lrouter.subnet	Change the IP address of the virtual NIC so that it is in the same subnet as the NSX Logical Route.	Health	Warning
Virtual machine default gateway does not match the Logical Router	Fault: nsx.event.vm.gateway.no.route.to.lrouter	Change the gateway address of the virtual NIC to match the IP address of the NSX Logical Router.	Health	Warning
NSX Edge VM is in a bad state	Fault: nsx.event.edge.vm.state.status	Run a force sync on the NSX Edge.	Health	Critical
Edge VM is not responding to health check	Fault: nsx.event.edge.vm.not.responding.to.health.check	Restart the virtual machine for this NSX Edge.	Health	Critical

DNS Edge Service Alert Definitions

The following alert definitions are defined on the DNS Edge Service objects.

Table 8. DNS Edge Service Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
DNS Service is not running	Fault: nsx.event.dns.service.status.down	Restart the DNS service.	Health	Critical

DHCP Edge Service Alert Definitions

The following alert definitions are defined on the DHCP Edge Service objects.

Table 9. DHCP Edge Service Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
DHCP Service is not running	Fault:nsx.event.dhcp.service.status.down	Restart the DHCP service	Health	Critical
One or more IP pools have reached capacity	Metric: IP Pool\|Usage (%) = 100	Add more IP addresses to the IP pool.	Risk	Warning
IP renewals are higher than normal	Metric: IP Pool\|IP Addresses Renewed (last interval) above dynamic threshold	Check the status of all virtual machines connected to the NSX DHCP service.	Health	Warning

IPSec VPN Edge Service Alert Definitions

The following alert definitions are defined on the IPSec VPN Edge Service objects.

Table 10. IPSec VPN Edge Service Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
IPSec VPN Service is not running.	Fault: nsx.event.ip.sec.service.status.down	Restart the IPSec VPN service.	Health	Critical
One or more IPSec channels are down.	Property: Channel\|Status = down	Check the status and configuration of all IPSec channels.	Health	Critical

L2 VPN Edge Service Alert Definitions

The following alert definitions are defined on the L2 VPN Edge Service objects.

Table 11. L2 VPN Edge Service Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
L2 VPN Service is not running	Fault: nsx.event.l2.vpn.service.status.down	Restart the L2 VPN service.	Health	Critical
One or more tunnels are down	Fault: nsx.event.l2vpn.tunnel.status.down	Check the status of the L2 VPN tunnel.	Health	Critical

Load Balancer Edge Service Alert Definitions

The following alert definitions are defined on the Load Balancer Edge Service objects.

Table 12. Load Balancer Edge Service Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
Load Balancer Service is not running.	Fault: nsx.event.lb.service.status.down	Restart the Load Balancer service.	Health	Critical
One or more Load Balancer pool members are down	Metric: Pool\|Member\|Status = DOWN Metric: Pool\|Active (%) = 0	Verify that all pool members are powered on. Verify that all pool members are connected to the network. Verify that the IP address of each pool member matches the NSX Load Balancer pool configuration.	Health	Warning Immediate
All members of a Load Balancer pool are down	Metric: Pool\|Active(%) = 0	Verify that all pool members are powered on. Verify that all pool members are connected to the network. Verify that the IP address of each pool member matches the NSX Load Balancer pool configuration.	Health	Immediate
One or more Virtual Servers are down	Metric: Virtual Server\|Active(%) = 0	Verify that all pool members are powered on. Verify that all pool members are connected to the network. Verify that the IP address of each pool member matches the NSX Load Balancer pool configuration.	Health	Critical

NAT Edge Service Alert Definitions

The following alert definitions are defined on the NAT Edge Service objects.

Table 13. Controller Cluster Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
One or more NAT rules have no destination	Fault: nsx.event.nat.rule.ip.with.no.corresponding.vm	Configure virtual machine IP addresses for all configured NAT rules. Remove any NAT rules that do not correspond to a virtual machine on the logical network.	Health	Warning

Routing Edge Service Alert Definitions

The following alert definitions are defined on the Routing Edge Service objects.

Table 14. Routing Edge Service Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
Interface to OSPF area mapping configuration missing or incomplete	Fault: nsx.event.edge.gateway.no.neighbors.relations	Verify the dynamic routing protocol configuration on the NSX Edge and physical routers.	Health	Immediate
One or more interfaces do not have an OSPF area to interface mapping	Property: Configuration\|OSPF\|Enabled is true Parent (NSX Edge) Property: Configuration\|OSPF\|Enabled is true Parent (NSX Edge) Metric:Interface\|OSPF\|Area Mapping is incorrect	Configure OSPF area to interface mappings on all interfaces that are connected to OSPF routers.	Health	Immediate
Number of learned routes is below normal	Metric: route\|ospf\|used is below dynamic threshold Metric: route\|bgp\|used is below dynamic threshold Metric: route\|connected\|used is below dynamic threshold	Run the "Check routing configuration" action and verify that the current routing table is correct	Risk	Warning
NSX Routing Edge Service is violating NSX Hardening guide	Metric: config\|ospf\|area\|auth_type != md5 Property: config\|ospf\|enabled = true	Fix the violations against NSX Hardening Guide Rules as per the recommendations in the NSX Hardening Guide.	Risk	Warning

SSL VPN Edge Service Alert Definitions

The following alert definitions are defined on the SSL VPN Edge Service objects.

Table 15. SSL VPN Edge Service Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
SSL VPN Service is not running	Fault: nsx.event.sslvpn.service.status.down	Restart the SSL VPN service.	Health	Critical

ECMP Cluster Alert Definitions

The following alert definitions are defined on the ECMP Cluster objects.

Table 16. ECMP Cluster Alert Definitions
Alert Name	Symptom	Recommendations	Impact	Severity
One or more Edges in the ECMP Cluster are down	Metric: Edge\|Active (%) > 50 Metric: Edge\|Active (%) < 100	Verify that the virtual machine for each NSX Edge is running. Verify that the virtual machine for each NSX Edge is connected to the network	Health	Warning
The majority of Edges in the ECMP Cluster are down	Metric: Edge\|Active (%) > 0 Metric: Edge\|Active (%) < 50	Verify that the virtual machine for each NSX Edge is running. Verify that the virtual machine for each NSX Edge is connected to the network.	Health	Immediate
All Edges in the ECMP Cluster are down	Metric: Edge\|Active (%) = 0	Verify that the virtual machine for each NSX Edge is running. Verify that the virtual machine for each NSX Edge is connected to the network.	Health	Critical