Monitoring the Environment

Monitoring environment is a vital component of maintaining normal operaions of VMware Data Services Manager.

VMware Data Services Manager collects health and metric data for the network and datastores associated with each vSphere cluster or VMC cluster that you onboard. You can view this data and use it to track the resource consumption and performance of your environments.

Note: VMware Data Services Manager does not report storage metrics for environments that are utilizing vSAN datastores.

Alerts and Environment Health

You can view the health of VMware Data Services Manager environment from the Environment view, Health tab.

VMware Data Services Manager reports the connectivity or capacity Status of the following Agent VM components, and triggers alerts when connectivity issues or thresholds warrant:

Alert Name	Threshold	→ Status	Alert Level	Triggered When	Impact
AGENT_CONNECTIVITY	N/A	Degraded	CRITICAL	The Agent VM lost connectivity to the control plane.	All operations with VMware Data Services Manager are affected.
DATA_DISK_HEALTH_ALERT	>= 90%	Degraded	CRITICAL	The data disk on the Agent VM has reached 90% capacity.	All operations with VMware Data Services Manager are affected.
DATA_DISK_HEALTH_ALERT	>= 70%	Warning	WARNING	The data disk on the Agent VM has reached 70% capacity.	No impact.
INFLUXDB	N/A	Warning	CRITICAL	The Influxdb service running in the Agent VM is down.	Metrics collected from vCenter and Database VMs are not stored in influxDB. Therefore, all metrics are affected.
MONITORING	N/A	Warning	CRITICAL	The monitoring service running in the Agent VM is down.	VMware Data Services Manager is not able to process alerts and metrics. However, no operations in VMware Data Services Manager are affected.
ONBOARDING	N/A	Warning	CRITICAL	The onboarding service running in the Agent VM is down.	VMware Data Services Manager cannot re-onboard or Recover an Agent VM. Therefore APIs for the Agent, such as APIs for custom certificate, APIs for changing log level, and APIs for onboarding do not work.
RCLONE	N/A	Degraded	CRITICAL	The rclone service running in the Agent VM is down.	VMware Data Services Manager cannot upload transaction logs and download database templates. Other operations affected are log bundles generation, database recovery and PITR, resynching of cloud backup, and Engine and OS updates.
SYSTEM_DISK_HEALTH_ALERT	>= 90%	Degraded	CRITICAL	The system disk on the Agent VM has reached 90% capacity.	All operations with VMware Data Services Manager are affected. VMware Data Services Manager recommends extending existing storage, adding new storage, and deleting databases that are not required.
SYSTEM_DISK_HEALTH_ALERT	>= 70%	Warning	WARNING	The system disk on the Agent VM has reached 70% capacity.	No impact. VMware Data Services Manager recommends extending existing storage, adding new storage, and delete databases that are not required.
TELEGRAF	N/A	Degraded	CRITICAL	The Telegraf service running in the Agent VM is down.	Agent VM is not able to collect vCenter metrics. No operations with VMware Data Services Manager are affected apart from vCenter and Agent VM metric collection.
VCENTER	N/A	Degraded	CRITICAL	VMware Data Services Manager is unable to connect to vCenter.	VMware Data Services Manager is not able to connect to vCenter for any Database VM related operations. Also, infrastructure metrics is affected.
VCENTER	N/A	Warning	WARNING	vCenter password is going to expire in less than 15 days.	In case the vCenter password has expired, then you can update the password in vCenter and update the same in VMware Data Services Manager.
VM PASSWORD EXPIRY	< = 15 days	Warning	WARNING	The password of the Agent VM is going to expire in less than or equal 15 days.	No impact.
VM PASSWORD EXPIRY	< 0 days	Degraded	CRITICAL	The password of the Agent VM is expired.	All the functions that involve the Agent VM and its environment are impacted.

Troubleshooting

If you find that an environment is in a DEGRADED or ERROR state, you may be able to rectify the situation by restarting certain services on the Agent VM.

Alert Name	Affected VM OS Service Name
INFLUXDB	influxdb
KAPACITOR	kapacitor
MONITORING	monitoring
ONBOARDING	onboarding
RCLONE	rclone
TELEGRAF	telegraf

Perform the following procedure to clear alerts for an environment with ERROR or DEGRADED status:

Validate the connection between the Agent VM and vCenter.
Examine the environment health, and identify the component associated with each critical or warning alerts.
SSH into the Agent VM.

Restart each affected service. For example:

user@agentvm$ systemctl restart influxdb.service

Check the environment status and health again.

Ensure that all components display the OK status.