An alert is a stateful record for a problem. SDDC Manager raises an alert based on the detection of problem conditions in the hardware or virtual resources. Problem detection can occur during the Power On System Validation (POSV) portion of the Cloud Foundation bring-up process and during ongoing operations.

During ongoing operations, SDDC Manager raises alerts for problems detected as a result of its periodic polling of hardware status or from alert-raising events. Alerts are not generated for fleeting conditions or for problems that the environment can resolve itself. Alerts are raised for issues that:

  • Persist

  • Require human intervention to resolve

The software periodically polls the status of the hardware resources and raises alerts when analysis of the results indicates a problem condition exists.

  • Every 30 minutes, the servers and switches are polled to verify that those resources are discoverable and to obtain the power status of the servers and switches. This 30-minute polling ensures that any status change of a server or switch is captured, if it has not already been captured by generated events.

  • Every 24 hours, the hardware resources are polled to determine the current hardware resources and refresh its hardware inventory information with the obtained information. This 24-hour polling ensures that any hardware change that has occurred in the installation in the last 24 hours is captured. Inventory validation alerts are raised when mismatches are found between the obtained actual inventory and the expected inventory. The expected inventory is defined by the installation's manifest.

After each polling interval, the built-in problem-detection service is called to analyze the updated status and inventory information and determine whether a persistent condition exists. If a problem that requires human intervention exists, an alert is raised. Even though multiple events can be generated for a particular outstanding problem, only one alert is created about the persistent problem. You then verify and resolve the reported problem and clear the alert using the SDDC Manager client.

In addition to alerts raised as a result of conditions found by the periodic polling, certain events initiate the raising of alerts at the time when those events are generated. Unless noted otherwise in the following table, the event-initiating alert's name is the event's name plus the suffix _ALERT added to the end of the event name. As an example, the BMC_AUTHENTICATION_FAILURE event raises the alert named BMC_AUTHENTICATION_FAILURE_ALERT. See Event Catalog for a list of the event definitions that you can view in the Event Catalog user interface.

Some of the alerts are more likely to be raised during the Power On System Validation (POSV) portion of the bring-up process. As an example, the alert named VMWARE_CLOUD_FOUNDATION_BUNDLE_INCOMPLETE_ALERT is raised during POSV if the system detects elements are missing from the software ISO file. For the list of alerts that are raised during POSV, see the VMware Cloud Foundation Overview and Bring-Up Guide.

You can use the Alerts Catalog page in the SDDC Manager client to view the SDDC Manager alert definitions. You open the Alert Catalog from the System Alerts page by clicking Catalog. For more information about using the Alerts Catalog page, see Alert Catalog.

Table 1. SDDC Manager Alerts

Alert Name

Short Description

Severity

Detected By

BMC_AUTHENTICATION_FAILURE_ALERT

The system is unable to authenticate to the server's OOB management port. This alert is initiated by the BMC_AUTHENTICATION_FAILURE event.

WARNING

Event

BMC_MANAGEMENT_FAILURE_ALERT

The system failed to perform a management operation using the server's OOB management port. This alert is initiated by the BMC_MANAGEMENT_FAILURE event.

WARNING

Event

BMC_NOT_REACHABLE_ALERT

The system is unable to communicate with the server's OOB management port. This alert is initiated by the BMC_NOT_REACHABLE event.

WARNING

Event

COORDINATION_SERVICE_DOWN_ALERT

The system cannot establish a connection to the virtual machines that provide the required coordination service. This service is provided by the ISVM virtual machines that run in the N0 ESXi host in the environment's primary rack. The bring-up process requires connection to the coordination service.

CRITICAL

Event

30-minute poll

24-hour poll

CPU_CAT_FAILURE_ALERT

A CPU has shut down due to the processor's catastrophic error (CATERR) signal. This alert is initiated by the CPU_CAT_ERROR event.

ERROR

Event

CPU_EXTRA_ALERT

The polling found an additional CPU that does not match what is expected according to the manifest.

WARNING

24-hour poll

CPU_INITIALIZATION_ERROR_ALERT

The system detected that a CPU startup initialization error has occurred. This alert is initiated by the CPU_INITIALIZATION_ERROR event.

ERROR

Event

CPU_INVALID_ALERT

The polling detected a type of CPU in the server that does not match what is expected according to the manifest.

ERROR

24-hour poll

CPU_MACHINE_CHECK_ERROR_ALERT

A server CPU has failed due to CPU Machine Check Error. This alert is initiated by the CPU_MACHINE_CHECK_ERROR event.

ERROR

Event

CPU_POST_FAILURE_ALERT

A server CPU has shut down due to POST failure. This alert is initiated by the CPU_POST_FAILURE event.

ERROR

Event

CPU_TEMPERATURE_ABOVE_UPPER_THRESHOLD_ALERT

A CPU temperature has reached its maximum safe operating temperature. This alert is initiated by the CPU_TEMPERATURE_ABOVE_UPPER_THRESHOLD event.

WARNING

Event

CPU_TEMPERATURE_BELOW_LOWER_THRESHOLD_ALERT

A CPU temperature has reached its minimum safe operating temperature. This alert is initiated by the CPU_TEMPERATURE_BELOW_LOWER_THRESHOLD event.

WARNING

Event

CPU_THERMAL_TRIP_ERROR_ALERT

A server CPU has shut down due to thermal error. This alert is initiated by the CPU_THERMAL_TRIP_ERROR event.

ERROR

Event

CPU_UNDETECTED_ALERT

The polling did not detect a CPU that matches what is expected according to the manifest.

ERROR

24-hour poll

DIMM_ECC_MEMORY_ERROR_ALERT

The system detected an uncorrectable Error Correction Code (ECC) error for a server's memory. This alert is initiated by the DIMM_ECC_MEMORY_ERROR event.

ERROR

Event

DIMM_TEMPERATURE_ABOVE_THRESHOLD_ALERT

Memory temperature has reached its maximum safe operating temperature. This alert is initiated by the DIMM_TEMPERATURE_ABOVE_THRESHOLD event.

WARNING

Event

DIMM_THERMAL_TRIP_ALERT

Memory has shut down due to thermal error. This alert is initiated by the DIMM_THERMAL_TRIP event.

ERROR

Event

HDD_DOWN_ALERT

Operational status is down for an HDD. This alert is initiated by the HDD_DOWN event.

ERROR

Event

HDD_EXCESSIVE_READ_ERRORS_ALERT

Excessive read errors reported for an HDD. This alert is initiated by the HDD_EXCESSIVE_READ_ERRORS event.

WARNING

Event

HDD_EXCESSIVE_WRITE_ERRORS_ALERT

Excessive write errors reported for an HDD. This alert is initiated by the HDD_EXCESSIVE_WRITE_ERRORS event.

WARNING

Event

HDD_EXTRA_ALERT

The polling found an additional HDD that does not match what is expected according to the manifest.

WARNING

24-hour poll

HDD_INVALID_ALERT

The polling detected a type of HDD that does not match what is expected according to the manifest.

ERROR

24-hour poll

HDD_TEMPERATURE_ABOVE_THRESHOLD_ALERT

HDD temperature has reached its maximum safe operating temperature. This alert is initiated by the HDD_TEMPERATURE_ABOVE_THRESHOLD event.

WARNING

Event

HDD_UNDETECTED_ALERT

The polling did not detect an HDD that matches what is expected according to the manifest.

ERROR

24-hour poll

HDD_WEAROUT_ABOVE_THRESHOLD_ALERT

Wear-out state of an HDD is above its defined threshold. This alert is initiated by the HDD_WEAROUT_ABOVE_THRESHOLD event.

WARNING

Event

HMS_AGENT_DOWN_ALERT

The Hardware Management Services (HMS) aggregator cannot communicate with the HMS agent on the rack's management switch through the private management network, either because the agent is down or the network is not available. This alert is initiated by the HMS_AGENT_DOWN event or by polling.

CRITICAL

30-minute poll

24-hour poll

Event

HMS_DOWN_ALERT

The SDDC Manager cannot communicate with the HMS aggregator.

CRITICAL

30-minute poll

24-hour poll

Event

HOST_AGENT_NOT_ALIVE_ALERT

This alert is raised when the polling detects that an ESXi host does not have its hostd process running or when the system is unable to determine if the hostd process is running. The hostd (host daemon) is an infrastructure service agent in the ESXi operating system.

CRITICAL

30-minute poll

24-hour poll

LICENSE_PRESENT_CHECK_FAILED_ALERT

The check for the license for a particular bundle failed.

WARNING

Event

MANAGEMENT_SWITCH_DOWN_ALERT

Operational status is down for a physical rack's management switch. This alert is initiated by the periodic polling and by the MANAGEMENT_SWITCH_DOWN event.

WARNING

Event

30-minute poll

24-hour poll

MANAGEMENT_SWITCH_EXTRA_ALERT

The polling found an additional management switch that does not match what is expected according to the manifest.

WARNING

24-hour poll

MANAGEMENT_SWITCH_INVALID_ALERT

The polling detected a type of management switch that does not match what is expected according to the manifest.

CRITICAL

24-hour poll

MANAGEMENT_SWITCH_PORT_DOWN_ALERT

Operational status is down for a switch port in a physical rack's management switch. This alert is initiated by the MANAGEMENT_SWITCH_PORT_DOWN event.

WARNING

Event

MEMORY_EXTRA_ALERT

The polling found additional memory that does not match what is expected according to the manifest.

WARNING

24-hour poll

MEMORY_INVALID_ALERT

The polling detected a type of memory that does not match what is expected according to the manifest.

ERROR

24-hour poll

MEMORY_UNDETECTED_ALERT

The polling did not detect memory that matches what is expected according to the manifest.

ERROR

24-hour poll

NIC_EXTRA_ALERT

The polling found an additional NIC that does not match what is expected according to the manifest.

WARNING

24-hour poll

NIC_INVALID_ALERT

The polling detected a type of NIC that does not match what is expected according to the manifest.

ERROR

24-hour poll

NIC_PORT_DOWN_ALERT

Operational status is down for a NIC port in a rack's server. This alert is initiated by the NIC_PORT_DOWN event.

WARNING

Event

NIC_UNDETECTED_ALERT

The polling did not detect a NIC that matches what is expected according to the manifest.

ERROR

24-hour poll

PCH_TEMPERATURE_ABOVE_THRESHOLD_ALERT

Platform controller hub [PCH] temperature has reached its maximum safe operating temperature. This alert is initiated by the PCH_TEMPERATURE_ABOVE_THRESHOLD event.

WARNING

Event

POSTGRES_DOWN_ALERT

The system cannot connect to an internal database.

CRITICAL

Event

SERVER_DOWN_ALERT

Server is in the powered-down state. This alert is initiated by the SERVER_DOWN event.

ERROR

Event

30-minute poll

24-hour poll

SERVER_EXTRA_ALERT

The polling detected an additional server that does not match what is expected according to the manifest.

WARNING

24-hour poll

SERVER_INVALID_ALERT

The polling detected a type of server that does not match what is expected according to the manifest.

ERROR

24-hour poll

SERVER_PCIE_ERROR_ALERT

A server's system has PCIe errors. This alert is initiated by the SERVER_PCIE_ERROR event.

ERROR

Event

SERVER_POST_ERROR_ALERT

A server has POST failures.

ERROR

Event

SERVER_UNDETECTED_ALERT

The polling did not detect a server that matches what is expected according to the manifest.

ERROR

30-minute poll

24-hour poll

SPINE_SWITCH_DOWN_ALERT

Operational status is down for a physical rack's spine switch. This alert is initiated by the periodic polling and by the SPINE_SWITCH_DOWN event.

ERRORS

Event

30-minute poll

24-hour poll

SPINE_SWITCH_EXTRA_ALERT

The polling detected an additional spine switch that does not match what is expected according to the manifest.

WARNING

24-hour poll

SPINE_SWITCH_INVALID_ALERT

The polling detected a type of spine switch that does not match what is expected according to the manifest.

ERROR

24-hour poll

SPINE_SWITCH_PORT_DOWN_ALERT

Operational status is down for a switch port: in a physical rack's spine switch. This alert is initiated by the SPINE_SWITCH_PORT_DOWN event.

WARNING

Event

SSD_DOWN_ALERT

Operational status is down for an SSD. This alert is initiated by the SSD_DOWN event.

ERROR

Event

SSD_EXCESSIVE_READ_ERRORS_ALERT

Excessive read errors reported for an SSD. This alert is initiated by the SSD_EXCESSIVE_READ_ERRORS event.

WARNING

Event

SSD_EXCESSIVE_WRITE_ERRORS_ALERT

Excessive write errors reported for an SSD. This alert is initiated by the SSD_EXCESSIVE_WRITE_ERRORS event.

WARNING

Event

SSD_EXTRA_ALERT

The polling found an additional SSD that does not match what is expected according to the manifest.

WARNING

24-hour poll

SSD_INVALID_ALERT

The polling detected a type of SSD that does not match what is expected according to the manifest.

ERROR

24-hour poll

SSD_TEMPERATURE_ABOVE_THRESHOLD_ALERT

SSD temperature has reached its maximum safe operating temperature. This alert is initiated by the SSD_TEMPERATURE_ABOVE_THRESHOLD event.

WARNING

Event

SSD_UNDETECTED_ALERT

The polling did not detect an SSD that matches what is expected according to the manifest.

ERROR

24-hour poll

SSD_WEAROUT_ABOVE_THRESHOLD_ALERT

Wear-out state of an SSD is above its defined threshold. This alert is initiated by the SSD_WEAROUT_ABOVE_THRESHOLD event.

WARNING

Event

STORAGE_CONTROLLER_DOWN_ALERT

Operational status is down for a storage adapter. This alert is initiated by the STORAGE_CONTROLLER_DOWN event.

ERROR

Event

STORAGE_CONTROLLER_EXTRA_ALERT

The polling detected an additional storage adapter that does not match what is expected according to the manifest.

WARNING

24-hour poll

STORAGE_CONTROLLER_INVALID_ALERT

The polling detected a type of storage adapter that does not match what is expected according to the manifest.

ERROR

24-hour poll

STORAGE_CONTROLLER_UNDETECTED_ALERT

The polling did not detect a storage adapter that matches what is expected according to the manifest.

ERROR

24-hour poll

TOR_SWITCH_DOWN_ALERT

Operational status is down for a physical rack's ToR switch. This alert is initiated by the periodic polling and by the TOR_SWITCH_DOWN event.

ERROR

Event

30-minute poll

24-hour poll

TOR_SWITCH_EXTRA_ALERT

The polling found an additional ToR switch that does not match what is expected according to the manifest.

WARNING

24-hour poll

TOR_SWITCH_INVALID_ALERT

The polling detected a type of ToR switch that does not match what is expected according to the manifest.

ERROR

24-hour poll

TOR_SWITCH_PORT_DOWN_ALERT

Operational status is down for a switch port in a physical rack's ToR switch. This alert is initiated by the TOR_SWITCH_PORT_DOWN event.

WARNING

Event

VMWARE_CLOUD_FOUNDATION_BUNDLE_INCOMPLETE_ALERT

The ISO file is missing items, according to its manifest.

CRITICAL

Event

VMWARE_CLOUD_FOUNDATION_BUNDLE_INVALID_ALERT

Checksum validation for the ISO file failed.

CRITICAL

Event

VMWARE_CLOUD_FOUNDATION_BUNDLE_MISSING_ALERT

A required ISO file or its expected checksum file or manifest file is missing.

CRIT|CAL

Event