Monitoring the Infrastructure

Monitoring infrastructure is a vital component of maintaining normal operaions of Data Management for VMware Tanzu.

Data Management for VMware Tanzu collects health and metric data for the network and datastores associated with each vSphere cluster that you onboard with DMS. You can view this data and use it to track the resource consumption and performance of your environments.

Note: Data Management for VMware Tanzu does not report storage metrics for infrastructures that are utilizing vSAN datastores.

Alerts and Infrastructure Health

You can view the health of Data Management for VMware Tanzu infrastructure from the Infrastructure view, Health tab.

Data Management for VMware Tanzu reports the connectivity or capacity Status of the following Agent VM components, and triggers alerts when connectivity issues or thresholds warrant:

Alert Name	Threshold	→ Status	Alert Level	Triggered When
AGENT_CONNECTIVITY	N/A	Degraded	CRITICAL	The Agent VM lost connectivity to the control plane.
CLOUDSTORAGE	N/A	Degraded	CRITICAL	The Agent VM is not able to connect to the Cloud Storage Repo.
DATA_DISK_HEALTH_ALERT	>= 90%	Degraded	CRITICAL	The data disk on the Agent VM has reached 90% capacity.
DATA_DISK_HEALTH_ALERT	>= 70%	Warning	WARNING	The data disk on the Agent VM has reached 70% capacity.
INFLUXDB	N/A	Degraded	CRITICAL	The Influxdb service running in the Agent VM is down.
KAPACITOR	N/A	Degraded	CRITICAL	The Kapacitor service running in the Agent VM is down.
LOCALSTORAGE	N/A	Degraded	CRITICAL	The Agent VM is not able to connect to the Local Storage Repo.
MONITORING	N/A	Degraded	CRITICAL	The monitoring service running in the Agent VM is down.
ONBOARDING	N/A	Warning	CRITICAL	The onboarding service running in the Agent VM is down.
RCLONE	N/A	Degraded	CRITICAL	The rclone service running in the Agent VM is down.
SYSTEM_DISK_HEALTH_ALERT	>= 90%	Degraded	CRITICAL	The system disk on the Agent VM has reached 90% capacity.
SYSTEM_DISK_HEALTH_ALERT	>= 70%	Warning	WARNING	The system disk on the Agent VM has reached 70% capacity.
TELEGRAF	N/A	Degraded	CRITICAL	The Telegraf service running in the Agent VM is down.
VCENTER	N/A	Degraded	CRITICAL	Data Management for VMware Tanzu is unable to connect to vCenter.

Troubleshooting

If you find that an infrastructure is in a DEGRADED or ERROR state, you may be able to rectify the situation by restarting certain services on the Agent VM.

Alert Name	Affected VM OS Service Name
INFLUXDB	influxdb
KAPACITOR	kapacitor
MONITORING	monitoring
ONBOARDING	onboarding
RCLONE	rclone
TELEGRAF	telegraf

Perform the following procedure to clear alerts for an infrastructure with ERROR or DEGRADED status:

Validate the connection between the Agent VM and vCenter.
Examine the infrastructure health, and identify the component associated with each critical or warning alerts.
SSH into the Agent VM.

Restart each affected service. For example:

user@agentvm$ systemctl restart influxdb.service

Check the infrastructure status and health again.

Ensure that all components display the OK status.