The Management Pack for NSX for vSphere monitors the objects in your environment and notifies you when a problem is detected in the form of an alert. Alerts are based on alert definitions, a combination sof symptoms and recommendations, that identify problems areas in your environment. Alerts are generated when the collected data is compared to the alert definition and the alert symptoms are evaluated as true.
Controller Alert Definitions
The following alert definitions are defined on the Controller objects.
Alert Name | Symptoms | Recommendations | Impact | Severity |
---|---|---|---|---|
Controller is down |
|
|
Health | Critical |
Controller resource usage is high |
|
|
Health | Warning |
No syslog server is configured |
|
Configure a syslog server on the NSX Controller. | Risk | Immediate |
The Controller VM has been removed from the vCenter |
|
|
Health | Critical |
Controller Cluster Alert Definitions
The following alert definitions are defined on the Controller Cluster objects
Alert Name | Symptoms | Recommendations | Impact | Severity |
---|---|---|---|---|
No cluster majority can be established |
|
|
Health | Critical |
Less than three controllers are active | Metric: Status|Active Controllers is less than 3 |
|
Risk | Immediate |
Less than three controllers are deployed | Metric: Status|Controllers is less than 3 |
|
Risk | Immediate |
All Controller VMs are deployed on the same host |
|
|
Risk | Warning |
Manager Alert Definitions
The following alert definitions are defined on the Manager objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
Manager resource usage is high |
|
Investigate CPU and memory utilization on the NSX Manager to determine if there is an issue. | Health | Warning |
vCenter inventory connection has been lost | Property: Status|vCenter Connection Status = Disconnected |
|
Health | Critical |
No backup of the environment has been recorded | Property: Status|Last Backup Time = None | Start a backup of the environment from the NSX Manager. | Risk | Immediate |
Scheduled backups are not enabled | Property: Configuration|Backup Scheduled = false | Configure scheduled backups of the environment from the NSX Manager. | Risk | Immediate |
Manager API calls are failing | Fault: nsx.event.manager.api.non.responsive |
|
Health | Critical |
VXLAN segment range has been exhausted | Metric: VXLAN|Usage (%) = 100 | Add additional logical segments to the Transport Zone. | Risk | Warning |
NSX Manager is violating NSX Hardening Guide |
|
Fix the violations against NSX Hardening Guide Rules as per the recommendations in the NSX Hardening Guide. | Risk | Warning |
The RabbitMQ service is not running |
|
|
Health | Immediate |
The vPostgres service is not running |
|
|
Health | Immediate |
The Management service is not running |
|
|
Health | Immediate |
The Replicator service is not running |
|
|
Health | Immediate |
NSX Manager is down |
|
|
Health | Critical |
NSX Edge Alert Definitions
The following alert definitions are defined on the NSX Edge objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
Edge resource usage is high |
|
|
Health | Warning |
Edge is not highly available |
|
|
Risk | Immediate |
High availability is not enabled on Edge | Property: Status|HA Status = Off | Enable High Availability on the NSX Edge. | Risk | Immediate |
One or more Edge interfaces are down | Metric: Interface|Status = down | Check the admin status of the interfaces on the NSX Edge. | Health | Critical |
One or more Edges in the ECMP Cluster are down |
|
|
Health | Warning |
All Edges in the ECMP Cluster are down | Metric: Edge|Active (%) = 0 |
|
Health | Critical |
Edge VM is not responding to health check | Fault: nsx.event.edge.vm.not.responding.to.health.check | Restart the virtual machine for thisNSX Edge. | Health | Critical |
Edge is not deployed |
|
|
Health | Critical |
All of the Edge VMs are powered off | Metric: Status|Running = 0 | Power on at least one of the virtual machines | Health | Critical |
Edge API calls are failing | Fault: nsx.event.edge.gateway.api.failure |
|
Health | Critical |
Edge VM is not responding to health check | Fault: nsx.event.edge.vm.not.responding.to.heath.check | Restart the virtual machine for thisNSX Edge. | Health | Critical |
A firewall, NAT, load balancer, or VPN service is running on this NSX Edge with ECMP enabled |
|
Disable all Stateful Services (Firewall, NAT, Load Balancer, and VPN) on this NSX Edge. | Health | Warning |
The MTU of one or more interfaces does not match the next hop router | Metric: Interface|MTU Mismatch is true | Configure the same MTU on all routes. | Health | Warning |
One or more OSPF neighbors are not in the full state |
|
|
Health | Immediate |
One or more BGP neighbors are down | The alert condition will be triggered using following event from Log Insight:
|
|
Health | Immediate |
Network utilization on the high availability interface is high | The alert condition will be triggered using the following event from Log Insight:
|
|
Risk | Immediate |
NSX Edge Services Gateway is violating the NSX Hardening guide | Property: service|ssh|status = RUNNING | Fix the violations against NSX Hardening Guide Rules as per the recommendations in the NSX Hardening Guide. | Risk | Warning |
Logical Router Alert Definitions
The following alert definitions are defined on the Logical Router objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
Interface to OSPF area mapping configuration missing or incomplete. | Fault: nsx.event.logical.router.no.neighbors. relations | Verify the dynamic routing protocol configuration on the NSX Logical Router and physical routes. | Health | Immediate |
The backing port group has been removed from vCenter | Fault: nsx.event.logical.switch.port.group.removed | Redeploy the logical switch | Health | Critical |
One or more Logical Router interfaces are down | Metric: Interface|Status = down | Check the admin status of the interfaces on the NSX Logical Router | Health | Critical |
Logical Router is not deployed | Fault: nsx.event.logical.router.status.unknown |
|
Health | Critical |
Logical Router does not have an uplink interface configured | Fault: nsx.event.|router.no.connected.uplink.iface | Check if the router configuration is only for routing between internal networks or external access is required. If external access is required then configure an uplink interface on the NSX Logical Router. | Health | Warning |
Number of learned routes is below normal |
|
Run the "Check routing configuration" action and verify that the current routing table is correct | Risk | Warning |
One or more OSPF areas are using insecure authentication | Fault: OSPF Area|Authentication Type!=MD5 | Configure the Logical Router to use MD5 authentication for all OSPF areas. | Risk | Immediate |
Logical Router is deployed to the same host as one or more ECMP Edges | Fault: nsx.event.lrouter.deployed.on.ecmp.edge.host | Move the virtual machines for this Logical Router to a different host. | Risk | Immediate |
The MTU of one or more interfaces does not match the next hop router | Metric: Interface|MTU Mismatch is true | Configure the same MTU on all routes. | Health | Warning |
One or more OSPF neighbors are down |
|
|
Health | Immediate |
One or more BGP neighbors are down | The alert condition will be triggered using following event from LogInsight:
|
|
Health | Immediate |
NSX Logical Router is violating NSX Hardening guide |
|
Fix the violations against NSX Hardening Guide Rules as per the recommendations in the NSX Hardening Guide. | Risk | Warning |
Host System Alert Definitions
The following alert definitions are defined on the Host System objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
Host's NSX messaging infrastructure is reporting an issue | Fault: nsx.event.hostssystem.message.infra.status.down |
|
Health | Immediate |
Distributed Firewall CPU usage is high | Fault: nsx.event.firewall.cpu.above.threshold | Investigate the network utilization of all virtual machines on this host and migrate those with high traffic in order to reduce the load on the firewall. | Health | Immediate |
Distributed Firewall memory usage is high | Fault: nsx.event.firewall.mem.above.threshold | Investigate the network utilization of all virtual machines on this host and migrate those with high traffic in order to reduce the load on the firewall. | Health | Immediate |
Distributed Firewall connection rate is high | Fault: nsx.event.firewall.conn.rate.above.threshold | Investigate the network utilization of all virtual machines on this host and migrate those with high traffic in order to reduce the load on the firewall. | Health | Immediate |
A duplicate IP address was found for one or more physical NICs on this host | Fault: nsx.event.hostsystem.ip.conflict.exists | Reconfigure the physical NIC with an IP address that is unique on the network. | Health | Warning |
The MTU on one or more physical NICs is less than 1600 | Fault: nsx.event.hostsystem.mtu.unexpected | Set the MTU on each physical NIC to at least 1600. | Health | Warning |
The VTEP VMK is configured with a static IP address that is not known to the NSX Manager | Fault: nsx.event.vtep.vnic.ip.not.in.pool | Configure the VTEP VMK with a static IP address from the VTEP IP pool on the NSX Manager. | Health | Warning |
The network configuration of the VTEP VMK does not match the VXLAN configuration in NSX | Fault: nsx.event.vtep.vnic.misconfigured | Modify the IP configuration of the VTEP VMK to match the VXLAN configuration in NSX Manager. | Health | Warning |
There is a communication issue between the host and the NSX Manager which may cause network configuration to become out of sync |
|
Check the network connection between the host and the manager. | Health | Warning |
There is a communication issue between the host and the NSX Controller, which may cause network configuration to become out of sync | Fault: nsx.event.host.controller.connection.down | Check the network connection between the host and the controller. | Health | Warning |
Failed to create VXLAN interface | The alert condition will be triggered using any of the following events from LogInsight:
|
|
Health | Critical |
An error occurred while loading VXLAN module | The alert condition will be triggered using the following events from LogInsight:
|
|
Health | Critical |
Lost connection to NSX Controller | The alert condition will be triggered using the "VXLAN dataplane lost connection to controller" event from LogInsight. |
|
Health | Critical |
Distributed routing configuration is out of sync | The alert condition will be triggered using the following events from LogInsight:
|
|
Health | Critical |
Distributed firewall error occurred | The alert condition will be triggered using the following events from LogInsight:
|
|
Health | Warning |
Spoofguard error occurred | The alert condition will be triggered using the "Spoofguard errors by severity" event from LogInsight. |
|
Risk | Warning |
Logical network bridging configuration error occurred | The alert condition will be triggered using the following events from LogInsight:
|
|
Health | Immediate |
The VTEP VMK is configured with a subnet mask that is not known to NSX Manager | The alert condition will be triggered using a fault of score 25 that is raised on the host when this condition is detected. | Configure the VTEP VMK with the same subnet mask as the VTEP IP pool on the NSX Manager. | Health | Warning |
The VTEP VMK is configured with an MTU that is not known to NSX Manager | The alert condition will be triggered using a fault of score 25 that is raised on the host when this condition is detected. | Configure the VTEP VMK with the same MTU as what the host was prepared with. | Health | Warning |
The VTEP VMK on the host has been deleted | The alert condition will be triggered using the following event from LogInsight:
|
|
Health | Critical |
Virtual Machine Alert Definitions
The following alert definitions are defined on the Virtual Machine objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
Virtual machine IP addresses is not in the same subnet as the logical router | Fault: nsx.event.vm.vnic.ip.not.in.lrouter.subnet | Change the IP address of the virtual NIC so that it is in the same subnet as the NSX Logical Route. | Health | Warning |
Virtual machine default gateway does not match the Logical Router | Fault: nsx.event.vm.gateway.no.route.to.lrouter | Change the gateway address of the virtual NIC to match the IP address of the NSX Logical Router. | Health | Warning |
NSX Edge VM is in a bad state | Fault: nsx.event.edge.vm.state.status | Run a force sync on the NSX Edge. | Health | Critical |
Edge VM is not responding to health check | Fault: nsx.event.edge.vm.not.responding.to.health.check | Restart the virtual machine for this NSX Edge. | Health | Critical |
DNS Edge Service Alert Definitions
The following alert definitions are defined on the DNS Edge Service objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
DNS Service is not running | Fault: nsx.event.dns.service.status.down | Restart the DNS service. | Health | Critical |
DHCP Edge Service Alert Definitions
The following alert definitions are defined on the DHCP Edge Service objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
DHCP Service is not running | Fault:nsx.event.dhcp.service.status.down | Restart the DHCP service | Health | Critical |
One or more IP pools have reached capacity | Metric: IP Pool|Usage (%) = 100 | Add more IP addresses to the IP pool. | Risk | Warning |
IP renewals are higher than normal | Metric: IP Pool|IP Addresses Renewed (last interval) above dynamic threshold | Check the status of all virtual machines connected to the NSX DHCP service. | Health | Warning |
IPSec VPN Edge Service Alert Definitions
The following alert definitions are defined on the IPSec VPN Edge Service objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
IPSec VPN Service is not running. | Fault: nsx.event.ip.sec.service.status.down | Restart the IPSec VPN service. | Health | Critical |
One or more IPSec channels are down. | Property: Channel|Status = down | Check the status and configuration of all IPSec channels. | Health | Critical |
L2 VPN Edge Service Alert Definitions
The following alert definitions are defined on the L2 VPN Edge Service objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
L2 VPN Service is not running | Fault: nsx.event.l2.vpn.service.status.down | Restart the L2 VPN service. | Health | Critical |
One or more tunnels are down | Fault: nsx.event.l2vpn.tunnel.status.down | Check the status of the L2 VPN tunnel. | Health | Critical |
Load Balancer Edge Service Alert Definitions
The following alert definitions are defined on the Load Balancer Edge Service objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
Load Balancer Service is not running. | Fault: nsx.event.lb.service.status.down | Restart the Load Balancer service. | Health | Critical |
One or more Load Balancer pool members are down |
|
|
Health |
|
All members of a Load Balancer pool are down | Metric: Pool|Active(%) = 0 |
|
Health | Immediate |
One or more Virtual Servers are down | Metric: Virtual Server|Active(%) = 0 |
|
Health | Critical |
NAT Edge Service Alert Definitions
The following alert definitions are defined on the NAT Edge Service objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
One or more NAT rules have no destination | Fault: nsx.event.nat.rule.ip.with.no.corresponding.vm |
|
Health | Warning |
Routing Edge Service Alert Definitions
The following alert definitions are defined on the Routing Edge Service objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
Interface to OSPF area mapping configuration missing or incomplete | Fault: nsx.event.edge.gateway.no.neighbors.relations | Verify the dynamic routing protocol configuration on the NSX Edge and physical routers. | Health | Immediate |
One or more interfaces do not have an OSPF area to interface mapping |
|
Configure OSPF area to interface mappings on all interfaces that are connected to OSPF routers. | Health | Immediate |
Number of learned routes is below normal |
|
Run the "Check routing configuration" action and verify that the current routing table is correct | Risk | Warning |
NSX Routing Edge Service is violating NSX Hardening guide |
|
Fix the violations against NSX Hardening Guide Rules as per the recommendations in the NSX Hardening Guide. | Risk | Warning |
SSL VPN Edge Service Alert Definitions
The following alert definitions are defined on the SSL VPN Edge Service objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
SSL VPN Service is not running | Fault: nsx.event.sslvpn.service.status.down | Restart the SSL VPN service. | Health | Critical |
ECMP Cluster Alert Definitions
The following alert definitions are defined on the ECMP Cluster objects.
Alert Name | Symptom | Recommendations | Impact | Severity |
---|---|---|---|---|
One or more Edges in the ECMP Cluster are down |
|
|
Health | Warning |
The majority of Edges in the ECMP Cluster are down |
|
|
Health | Immediate |
All Edges in the ECMP Cluster are down | Metric: Edge|Active (%) = 0 |
|
Health | Critical |