About Monitoring Service Instances

Monitoring is a vital component of maintaining the availability and performance of your Service Instances.

Data Management for VMware Tanzu collects health and metric data for each Service Instance. You can view this data and use it to track the resource consumption, performance, and activity of your instances.

Service Instance Status

Data Management for VMware Tanzu uses the Service Instance Status to reflect the availability the instance, and in some cases to identify an in-progress operation or a critical operation that failed. The Status of a Service Instance is also affected by alerts that Data Management for VMware Tanzu may trigger on the instance.

The Status of a Service Instance may be one of the following values:

Status	Description
BACKUP_IN_PROGRESS	A backup of the Service Instance is in progress.
CONTROL_PLANE_UPDATE_IN_PROGRESS	A control plane software update is in progress.
CONTROL_PLANE_UPDATE_FAILED	A control plane software update failed.
CRITICAL	The Service Engine has at least one CRITICAL-level alert and no LOST_CONNECTIVITY or FATAL alert.
DB_ENGINE_UPDATE_IN_PROGRESS	A Service Engine software update is in progress.
DB_ENGINE_UPDATE_FAILED	A Service Engine software update failed.
DELETED	The Service Instance has been deleted.
DELETING	Deletion of the Service Instance has been initiated.
ERROR	Creation of the Service Instance failed.
FATAL	The Service Instance has at least one FATAL-level alert and no LOST_CONNECTIVITY alerts.
INIT	A create operation has been initiated for the Service Instance.
LOST_CONNECTIVITY	The Service Instance has at least one LOST_CONNECTIVITY alert.
ONLINE	The Service Instance was created successfully, is operating, and has no outstanding alerts.
OS_UPDATE_IN_PROGRESS	An operating system software update is in progress.
OS_UPDATE_FAILED	An operating system software update failed.
POWEREDOFF	Service Instance is powered off.
POWEREDON	The Service Instance VM is running, but awaiting a health check.
WARNING	The Service Instance has at least one WARNING-level alert and no LOST_CONNECTIVITY, FATAL, or CRITICAL alerts.

Viewing Service Instance Status

The Status of a Service Instance is an indicator of the overall health of the instance. You can view instance status using the Data Management for VMware Tanzu console or API.

Perform the following procedure to examine the status of a Service Instance:

Select Databases from the left navigation pane.

This action displays the Databases view, a table that lists the provisioned database Service Instances.
Examine the databases listed in the table, identify the database of interest, and navigate to that table row.
Examine the Status.

Viewing Service Instance Health

The health of a Service Instance reflects the status of the services running in the VM, the status of certain resources that it consumes, and its connectivity to internal and external components. Service Instance health is directly related to alerts that Data Management for VMware Tanzu may have triggered for the instance. Refer to Service Instance Alerts for details on the types and severity of alerts that DMS may trigger for an instance.

Perform the following procedure to view the health of a Service Instance:

Select Databases from the left navigation pane.

This action displays the Databases view, a table that lists the provisioned database Service Instances.
Examine the databases listed in the table, identify the database of interest, and navigate to that table row.
Click the database Instance Name.

The database information Details tab displays.
Select the Monitoring tab.

This action displays monitoring data for the Service Instance.
View the Health Status information.
Select the drop down menu in the top right corner to change the time-series aggregation period.

Service Instance Alerts

Data Management for VMware Tanzu triggers an alert on a database Service Instance when it encounters connectivity or resource issues on the instance. You monitor these alerts in the Database Alerts view.

The Database Alerts view displays the following information:

The Instance Name identifies the name of the Service Instance that has alerts.
The Status column identifies the status of the Service Instance.
The Critical column identifies the number of CRITICAL-level alerts associated with the Service Instance.
The Warning column identifies the number of WARNING-level alerts associated with the Service Instance.
The Environment column identifies the infrastructure on which the instance is running.
The Triggered Time identifies the time at which the most recent alert was triggered.
The Owner column identifies the Data Management for VMware Tanzu user that owns the Service Instance.

About the Alert Levels

Data Management for VMware Tanzu triggers alerts of the following levels:

OK
ONLINE
WARNING
CRITICAL
FATAL
LOST_CONNECTIVITY

About the Alert Types

Alerts that Data Management for VMware Tanzu may trigger on a Service Instance include the following:

Alert Name	Threshold	→ Status	Alert Level	Triggered When
LOST_CONNECTIVITY	N/A	Lost Connectivity	CRITICAL	The Service Instance is unreachable by DMS.
CPU_HEALTH_ALERT	90%	Critical	CRITICAL	vCPU utilization has reached 90%.
CPU_HEALTH_ALERT	70%	Warning	WARNING	vCPU utilization has reached 70%.
DATA_DISK_HEALTH_ALERT	90%	Fatal	CRITICAL	The data disk has reached 90% capacity.
DATA_DISK_HEALTH_ALERT	70%	Warning	WARNING	The data disk has reached 70% capacity.
DATABASE_BIN_LOG_ALERT	N/A	Critical	CRITICAL	The transactions logs on the Service Instance are not getting copied to local storage.
DATABASE_BIN_LOG_CLOUD_SYNC_ALERT	N/A	Warning	WARNING	The transaction logs on local and cloud storage are not in sync.
DATABASE_SERVICE_ALERT	N/A	Fatal	CRITICAL	The Service Instance database engine is down.
MAX_CONNECTIONS_ALERT	90%	Critical	CRITICAL	The number of open connections to the Service Instance is approaching the maximum.
MAX_CONNECTIONS_ALERT	80%	Warning	WARNING	The number of open connections to the Service Instance has reached 80% of the maximum.
METRICS_ALERT	N/A	Warning	WARNING	DMS cannot pull metrics from the Service Instance.
NTP_SYNC_ALERT	N/A	Critical	CRITICAL	The Service Instance NTP service is down or not in sync.
SYSTEM_DISK_HEALTH_ALERT	90%	Critical	CRITICAL	The system disk has reached 90% capacity.
SYSTEM_DISK_HEALTH_ALERT	70%	Warning	WARNING	The system disk has reached 70% capacity.
TELEGRAF_SERVICE_ALERT	N/A	Warning	WARNING	The Telgraf service is not responding.

Clearing Alerts

In some cases, you can clear certain alerts by restarting the affected service on the Service Instance VM.

Alert Name	Affected VM OS Service Name
NTP_SYNC_ALERT	systemd-timesyncd
TELEGRAF_SERVICE_ALERT	telegraf
METRICS_ALERT	telegraf
METRICS_ALERT¹	influxdb
DATABASE_SERVICE_ALERT	dbengine

¹ If a METRICS_ALERT is raised on all Service Instances, restart the influxdb.service on the Agent VM.

To clear an alert:

SSH into the Service Instance or Agent VM.

Restart the affected service. For example:

user@servinstvm$ systemctl restart telegraf.service

Addressing Other Alerts

If the Agent VM triggers a DATABASE_BIN_LOG_ALERT on a Service Instance, verify: the connection to Local Storage, the data disk space, and the service Engine status.

Service Instance Metrics

Data Management for VMware Tanzu collects metric data for each Service Instance. You can view this data and use it to track the resource consumption, performance, and activity of your instances.

A service instance is created with NORMAL or ENHANCED monitoring. The metrics for which Data Management for VMware Tanzu collects data in ENHANCED monitoring mode includes the NORMAL metrics, plus additional service-specific metrics.

NORMAL Monitoring

Data Management for VMware Tanzu displays the following DB Metrics when NORMAL monitoring is enabled for a Service Instance:

System Uptime - The time since the service or Service Instance VM restarted.
Mysql Uptime (MySQL)
Max Connections - The connection limit to the Service Instance.
Active Connections per Second (PostgreSQL)
Thread Resource Utilization (MySQL)
CPU Usage % - The ratio of used to allocated CPU.
Memory Usage % - The ratio of used to allocated memroy.
Disk Usage % - The ratio of used to allocated disk.

ENHANCED Monitoring

Additional PostgreSQL statistics displayed when ENHANCED monitoring is in effect include:

Write Throughput
Read Throughput
Commits & Rollbacks
Deadlocks & Conflicts

Additional MySQL statistics displayed when ENHANCED monitoring is in effect include:

Innodb Pool Size
Queries & Questions
Bytes Received & Sent
Slow Queries per Second
InnoDB Buffer Usage %
InnoDB Reads & Writes
Command Reads & Writes

Viewing the Metrics

You view metric data for a Service Instance in the Databases view, instance Monitoring tab, DB Metrics pane.

By default, Data Management for VMware Tanzu displays the last 3 hours of aggregated metric data. You can change this time period (calculated from current time) via a drop-down in the upper right corner of the DB Metrics tab.