An alert is a stateful record for a problem. SDDC Manager raises an alert based on the detection of problem conditions in the hardware or virtual resources. Problem detection can occur during the Power On System Validation (POSV) portion of the Cloud Foundation bring-up process and during ongoing operations.

During ongoing operations, SDDC Manager raises alerts for problems detected as a result of its periodic polling of hardware status or from alert-raising events. Alerts are not generated for fleeting conditions or for problems that the system can resolve itself. Alerts are raised for issues that:

  • Persist

  • Require human intervention to resolve

The software periodically polls the status of the hardware resources and raises alerts when analysis of the results indicates a problem condition exists.

  • Every 30 minutes, the servers and switches are polled to verify that those resources are discoverable and to obtain the power status of the servers and switches. This 30-minute polling ensures that any status change of a server or switch is captured, if it has not already been captured by generated events.

  • Every 24 hours, the hardware resources are polled to determine the current hardware resources and refresh its hardware inventory information with the obtained information. This 24-hour polling ensures that any hardware change that has occurred in the system in the last 24 hours is captured.

In addition to alerts raised as a result of conditions found by the periodic polling, certain events initiate the raising of alerts at the time when those events are generated. Unless noted otherwise in the following table, the event-initiating alert's name is the event's name plus the suffix _ALERT added to the end of the event name. As an example, the BMC_AUTHENTICATION_FAILURE event raises the alert named BMC_AUTHENTICATION_FAILURE_ALERT. See Event Catalog for a list of the event definitions that you can view in the Event Catalog user interface.

Some of the alerts are more likely to be raised during the Power On System Validation (POSV) portion of the bring-up process. As an example, the alert named VMWARE_CLOUD_FOUNDATION_BUNDLE_INCOMPLETE_ALERT is raised during POSV if the system detects elements are missing from the software ISO file. For the list of alerts that are raised during POSV, see the VMware Cloud Foundation Overview and Bring-Up Guide.

After each polling interval, the built-in problem-detection service is called to analyze the updated status and inventory information and determine whether a persistent condition exists. If a problem that requires human intervention exists, an alert is raised. Even though multiple events can be generated for a particular outstanding problem, only one alert is created about the persistent problem. You then verify and resolve the reported problem and clear the alert using the SDDC Manager Dashboard.

You can use the Alerts Catalog page in the SDDC Manager Dashboard to view the SDDC Manager alert definitions. You open the Alert Catalog from the System Alerts page by clicking Catalog. For more information about using the Alerts Catalog page, see Alert Catalog.

Table 1. SDDC Manager Alerts

Alert Name

Short Description

Severity

Detected By

CONFIGURATION_BACKUP_TRIGGER_ALERT

This alert is raised when domain creation, domain deletion, password rotation, domain expansion, host addition, rack addition, host decommission is done.

INFO

Event

30-minute poll

24-hour poll

COORDINATION_SERVICE_DOWN_ALERT

The system cannot establish a connection to the virtual machines that provide the required coordination service. The bring-up process requires connection to the coordination service.

CRITICAL

Event

30-minute poll

24-hour poll

CPU_EXTRA_ALERT

The polling found an additional CPU that does not match what is expected according to the manifest.

WARNING

24-hour poll

CPU_INVALID_ALERT

The polling detected a type of CPU in the server that does not match what is expected according to the manifest.

WARNING

24-hour poll

CPU_UNDETECTED_ALERT

The polling did not detect a CPU that matches what is expected according to the manifest.

ERROR

24-hour poll

HDD_DOWN_ALERT

Operational status is down for an HDD. This alert is initiated by the HDD_DOWN event.

ERROR

Event

HDD_EXCESSIVE_READ_ERRORS_ALERT

Excessive read errors reported for an HDD. This alert is initiated by the HDD_EXCESSIVE_READ_ERRORS event.

WARNING

Event

HDD_EXCESSIVE_WRITE_ERRORS_ALERT

Excessive write errors reported for an HDD. This alert is initiated by the HDD_EXCESSIVE_WRITE_ERRORS event.

WARNING

Event

HDD_EXTRA_ALERT

The polling found an additional HDD that does not match what is expected according to the manifest.

WARNING

24-hour poll

HDD_INVALID_ALERT

The polling detected a type of HDD that does not match what is expected according to the manifest.

WARNING

24-hour poll

HDD_TEMPERATURE_ABOVE_THRESHOLD_ALERT

HDD temperature has reached its maximum safe operating temperature. This alert is initiated by the HDD_TEMPERATURE_ABOVE_THRESHOLD event.

WARNING

Event

HDD_UNDETECTED_ALERT

The polling did not detect an HDD that matches what is expected according to the manifest.

WARNING

24-hour poll

HDD_WEAROUT_ABOVE_THRESHOLD_ALERT

Wear-out state of an HDD is above its defined threshold. This alert is initiated by the HDD_WEAROUT_ABOVE_THRESHOLD event.

WARNING

Event

HMS_AGENT_DOWN_ALERT

The Hardware Management Services (HMS) aggregator cannot communicate with the HMS agent on the rack's management switch through the private management network, either because the agent is down or the network is not available. This alert is initiated by the HMS_AGENT_DOWN event or by polling.

CRITICAL

30-minute poll

24-hour poll

Event

HMS_DOWN_ALERT

The SDDC Manager cannot communicate with the HMS aggregator.

CRITICAL

30-minute poll

24-hour poll

Event

HOST_AGENT_NOT_ALIVE_ALERT

This alert is raised when the polling detects that an ESXi host does not have its hostd process running or when the system is unable to determine if the hostd process is running. The hostd (host daemon) is an infrastructure service agent in the ESXi operating system.

CRITICAL

30-minute poll

24-hour poll

HOST_CANNOT_BE_USED_ALERT

Host {SERVER} in rack {RACK} was not configured correctly when it was added to the inventory. For information on why the host was not configured, see the VMware Cloud Foundation Overview and Bring-Up Guide.

CRITICAL

Event

HOST_DISKS_UNUSABLE_ALERT

This alert is raised if the disks in the host are not suitable for use with VSAN, either due to physical health or configuration issues.

Please check physical connectivity of disks in the host to assure all are present. Also check the configuration of the disks on the ESXi host. For additional assistance, please contact support.

CRITICAL

(DEBUG in logs)

30-minute poll

24-hour poll

HOST_VSAN_UNUSABLE_ALERT

This alert is raised if the vSAN status in the host is not suitable for use, either due to physical health or configuration issues.

Please check physical connectivity of disks in the host to assure all are present and check the configuration of the disks on the ESXi host. For additional assistance, please contact support.

CRITICAL

30-minute poll

24-hour poll

LICENSE_PRESENT_CHECK_FAILED_ALERT

The check for the license for a particular bundle failed.

WARNING

Event

MANAGEMENT_SWITCH_DOWN_ALERT

Operational status is down for a physical rack's management switch. This alert is initiated by the periodic polling and by the MANAGEMENT_SWITCH_DOWN event.

WARNING

Event

30-minute poll

24-hour poll

MANAGEMENT_SWITCH_EXTRA_ALERT

The polling found an additional management switch that does not match what is expected according to the manifest.

WARNING

24-hour poll

MANAGEMENT_SWITCH_INVALID_ALERT

The polling detected a type of management switch that does not match what is expected according to the manifest.

CRITICAL

24-hour poll

MANAGEMENT_SWITCH_PORT_DOWN_ALERT

Operational status is down for a switch port in a physical rack's management switch. This alert is initiated by the MANAGEMENT_SWITCH_PORT_DOWN event.

WARNING

Event

MEMORY_EXTRA_ALERT

The polling found additional memory that does not match what is expected according to the manifest.

WARNING

24-hour poll

MEMORY_INVALID_ALERT

The polling detected a type of memory that does not match what is expected according to the manifest.

WARNING

24-hour poll

MEMORY_UNDETECTED_ALERT

The polling did not detect memory that matches what is expected according to the manifest.

WARNING

24-hour poll

NETWORK_DOWN_ALERT

Network is down. The data connectivity among the servers transiting the switches cannot be assured.

This alert is raised when the inter-switch connectivity of our deployment is incorrect. Connectivity loss may be due to:

  • Switch port is physically down: no cable connected, wrong cable type, bad cable, loose connection, unsupported SFP, bad port.

  • Switch port has been administratively shut down.

  • Switch port has an error: such as bad or unsupported SFP, duplex mismatch, UDLD detects one-way link, BPDU port-guard and portfast configured simultaneously.

CRITICAL

30-minute poll

24-hour poll

NIC_EXTRA_ALERT

The polling found an additional NIC that does not match what is expected according to the manifest.

WARNING

24-hour poll

NIC_INVALID_ALERT

The polling detected a type of NIC that does not match what is expected according to the manifest.

WARNING

24-hour poll

NIC_PORT_DOWN_ALERT

Operational status is down for a NIC port in a rack's server. This alert is initiated by the NIC_PORT_DOWN event.

WARNING

Event

NIC_UNDETECTED_ALERT

The polling did not detect a NIC that matches what is expected according to the manifest.

WARNING

24-hour poll

POSTGRES_DOWN_ALERT

The system cannot connect to an internal database.

CRITICAL

Event

SDDC_MANAGER_NON_OPERATIONAL_ALERT

SDDC Manager is non-operational. A service in the SDDC Manager controller VM has failed and could not be restarted successfully.

ERROR

Event

SERVER_DOWN_ALERT

Server is in the powered-down state. This alert is initiated by the SERVER_DOWN event.

CRITICAL

Event

30-minute poll

24-hour poll

SERVER_EXTRA_ALERT

The polling detected an additional server that does not match what is expected according to the manifest.

WARNING

24-hour poll

SERVER_INVALID_ALERT

The polling detected a type of server that does not match what is expected according to the manifest.

WARNING

24-hour poll

SERVER_UNDETECTED_ALERT

The polling did not detect a server that matches what is expected according to the manifest.

ERROR

30-minute poll

24-hour poll

SPINE_SWITCH_DOWN_ALERT

Operational status is down for a physical rack's inter-rack switch. This alert is initiated by the periodic polling and by the SPINE_SWITCH_DOWN event.

ERRORS

Event

30-minute poll

24-hour poll

SPINE_SWITCH_EXTRA_ALERT

The polling detected an additional inter-rack switch that does not match what is expected according to the manifest.

WARNING

24-hour poll

SPINE_SWITCH_INVALID_ALERT

The polling detected a type of inter-rack switch that does not match what is expected according to the manifest.

ERROR

24-hour poll

SPINE_SWITCH_PORT_DOWN_ALERT

Operational status is down for a switch port: in a physical rack's inter-rack switch. This alert is initiated by the SPINE_SWITCH_PORT_DOWN event.

WARNING

Event

SSD_DOWN_ALERT

Operational status is down for an SSD. This alert is initiated by the SSD_DOWN event.

ERROR

Event

SSD_EXCESSIVE_READ_ERRORS_ALERT

Excessive read errors reported for an SSD. This alert is initiated by the SSD_EXCESSIVE_READ_ERRORS event.

WARNING

Event

SSD_EXCESSIVE_WRITE_ERRORS_ALERT

Excessive write errors reported for an SSD. This alert is initiated by the SSD_EXCESSIVE_WRITE_ERRORS event.

WARNING

Event

SSD_EXTRA_ALERT

The polling found an additional SSD that does not match what is expected according to the manifest.

WARNING

24-hour poll

SSD_INVALID_ALERT

The polling detected a type of SSD that does not match what is expected according to the manifest.

WARNING

24-hour poll

SSD_TEMPERATURE_ABOVE_THRESHOLD_ALERT

SSD temperature has reached its maximum safe operating temperature. This alert is initiated by the SSD_TEMPERATURE_ABOVE_THRESHOLD event.

WARNING

Event

SSD_UNDETECTED_ALERT

The polling did not detect an SSD that matches what is expected according to the manifest.

WARNING

24-hour poll

SSD_WEAROUT_ABOVE_THRESHOLD_ALERT

Wear-out state of an SSD is above its defined threshold. This alert is initiated by the SSD_WEAROUT_ABOVE_THRESHOLD event.

WARNING

Event

STORAGE_CONTROLLER_DOWN_ALERT

Operational status is down for a storage adapter. This alert is initiated by the STORAGE_CONTROLLER_DOWN event.

ERROR

Event

STORAGE_CONTROLLER_EXTRA_ALERT

The polling detected an additional storage adapter that does not match what is expected according to the manifest. The alert message includes the PCI ID of the controller.

WARNING

24-hour poll

STORAGE_CONTROLLER_INVALID_ALERT

The polling detected a type of storage adapter that does not match what is expected according to the manifest. The alert message includes the PCI ID of the controller.

WARNING

24-hour poll

STORAGE_CONTROLLER_UNDETECTED_ALERT

The polling did not detect a storage adapter that matches what is expected according to the manifest. The alert message includes the PCI ID of the controller.

WARNING

24-hour poll

TOR_SWITCH_DOWN_ALERT

Operational status is down for a physical rack's ToR switch. This alert is initiated by the periodic polling and by the TOR_SWITCH_DOWN event.

ERROR

Event

30-minute poll

24-hour poll

TOR_SWITCH_EXTRA_ALERT

The polling found an additional ToR switch that does not match what is expected according to the manifest.

WARNING

24-hour poll

TOR_SWITCH_INVALID_ALERT

The polling detected a type of ToR switch that does not match what is expected according to the manifest.

ERROR

24-hour poll

TOR_SWITCH_PORT_DOWN_ALERT

Operational status is down for a switch port in a physical rack's ToR switch. This alert is initiated by the TOR_SWITCH_PORT_DOWN event.

WARNING

Event

VCF_CONFIGURATION_BACKUP_NOT_CONFIGURED_ALERT

There is no scheduled backup of the VMware Cloud Foundation configuration.

WARNING

Event

VMWARE_CLOUD_FOUNDATION_BUNDLE_INCOMPLETE_ALERT

The ISO file is missing items, according to its manifest.

CRITICAL

Event

VMWARE_CLOUD_FOUNDATION_BUNDLE_INVALID_ALERT

Checksum validation for the ISO file failed.

CRITICAL

Event

VMWARE_CLOUD_FOUNDATION_BUNDLE_MISSING_ALERT

A required ISO file or its expected checksum file or manifest file is missing.

CRIT|CAL

Event