Monitoring environment is a vital component of maintaining normal operaions of Data Management for VMware Tanzu.
Data Management for VMware Tanzu collects health and metric data for the network and datastores associated with each vSphere cluster or VMC cluster that you onboard. You can view this data and use it to track the resource consumption and performance of your environments.
You can view the health of Data Management for VMware Tanzu environment from the Environment view, Health tab.
Data Management for VMware Tanzu reports the connectivity or capacity Status of the following Agent VM components, and triggers alerts when connectivity issues or thresholds warrant:
Alert Name | Threshold | → Status | Alert Level | Triggered When | Impact |
---|---|---|---|---|---|
AGENT_CONNECTIVITY | N/A | Degraded | CRITICAL | The Agent VM lost connectivity to the control plane. | All operations with Data Management for VMware Tanzu are affected. |
CLOUDSTORAGE | N/A | Degraded | CRITICAL | The Agent VM is not able to connect to the Cloud Storage Repo. | Operations affected are database creation, restore, and recovery along with deleting its backup and resynching of cloud backup. Also, log bundles generation and S3 configurations are affected. |
DATA_DISK_HEALTH_ALERT | >= 90% | Degraded | CRITICAL | The data disk on the Agent VM has reached 90% capacity. | All operations with Data Management for VMware Tanzu are affected. |
DATA_DISK_HEALTH_ALERT | >= 70% | Warning | WARNING | The data disk on the Agent VM has reached 70% capacity. | No impact. |
INFLUXDB | N/A | Warning | CRITICAL | The Influxdb service running in the Agent VM is down. | Metrics collected from vCenter and Database VMs are not stored in influxDB. Therefore, all metrics are affected. |
KAPACITOR | N/A | Degraded | CRITICAL | The Kapacitor service running in the Agent VM is down. | The Alert Engine for Data Management for VMware Tanzu is down. CPU, Memory, Data Disk, System Disk, and Max Connections alerts are not triggered. Therefore, CPU, Data Disk, System Disk, and Max Connection alerts are affected. |
LOCALSTORAGE | N/A | Degraded | CRITICAL | The Agent VM is not able to connect to the Local Storage Repo. | Operations affected are database creation, restore, backup, and recovery Also, log bundles generation, Engine and OS updates, and S3 configurations are affected. |
MONITORING | N/A | Warning | CRITICAL | The monitoring service running in the Agent VM is down. | Data Management for VMware Tanzu is not able to process alerts and metrics. However, no operations in Data Management for VMware Tanzu are affected. |
ONBOARDING | N/A | Warning | CRITICAL | The onboarding service running in the Agent VM is down. | Data Management for VMware Tanzu cannot re-onboard or Recover an Agent VM. Therefore APIs for the Agent, such as APIs for custom certificate, APIs for changing log level, and APIs for onboarding do not work. |
RCLONE | N/A | Degraded | CRITICAL | The rclone service running in the Agent VM is down. | Data Management for VMware Tanzu cannot upload transaction logs and download database templates. Other operations affected are log bundles generation, database recovery and PITR, resynching of cloud backup, and Engine and OS updates. |
SYSTEM_DISK_HEALTH_ALERT | >= 90% | Degraded | CRITICAL | The system disk on the Agent VM has reached 90% capacity. | All operations with Data Management for VMware Tanzu are affected. Data Management for VMware Tanzu recommends extending existing storage, adding new storage, and deleting databases that are not required. |
SYSTEM_DISK_HEALTH_ALERT | >= 70% | Warning | WARNING | The system disk on the Agent VM has reached 70% capacity. | No impact. Data Management for VMware Tanzu recommends extending existing storage, adding new storage, and delete databases that are not required. |
TELEGRAF | N/A | Degraded | CRITICAL | The Telegraf service running in the Agent VM is down. | Agent VM is not able to collect vCenter metrics. No operations with Data Management for VMware Tanzu are affected apart from vCenter and Agent VM metric collection. |
VCENTER | N/A | Degraded | CRITICAL | Data Management for VMware Tanzu is unable to connect to vCenter. | Data Management for VMware Tanzu is not able to connect to vCenter for any Database VM related operations. Also, infrastructure metrics is affected. |
VCENTER | N/A | Warning | WARNING | vCenter password is going to expire in less than 15 days. | In case the vCenter password has expired, then you can update the password in vCenter and update the same in Data Management for VMware Tanzu. |
If you find that an environment is in a DEGRADED or ERROR state, you may be able to rectify the situation by restarting certain services on the Agent VM.
Alert Name | Affected VM OS Service Name |
---|---|
INFLUXDB | influxdb |
KAPACITOR | kapacitor |
MONITORING | monitoring |
ONBOARDING | onboarding |
RCLONE | rclone |
TELEGRAF | telegraf |
Perform the following procedure to clear alerts for an environment with ERROR or DEGRADED status:
Validate the connection between the Agent VM and vCenter.
Examine the environment health, and identify the component associated with each critical or warning alerts.
SSH into the Agent VM.
Restart each affected service. For example:
user@agentvm$ systemctl restart influxdb.service
Check the environment status and health again.
Ensure that all components display the OK status.