Monitoring is a vital component of maintaining the availability and performance of your Service Instances.
Data Management for VMware Tanzu collects health and metric data for each Service Instance. You can view this data and use it to track the resource consumption, performance, and activity of your instances.
Data Management for VMware Tanzu uses the Service Instance Status to reflect the availability the instance, and in some cases to identify an in-progress operation or a critical operation that failed. The Status of a Service Instance is also affected by alerts that Data Management for VMware Tanzu may trigger on the instance.
The Status of a Service Instance may be one of the following values:
Status | Description |
---|---|
BACKUP_IN_PROGRESS | A backup of the Service Instance is in progress. |
CONTROL_PLANE_UPDATE_IN_PROGRESS | A control plane software update is in progress. |
CONTROL_PLANE_UPDATE_FAILED | A control plane software update failed. |
CRITICAL | The Service Engine has at least one CRITICAL-level alert and no LOST_CONNECTIVITY or FATAL alert. |
DB_ENGINE_UPDATE_IN_PROGRESS | A Service Engine software update is in progress. |
DB_ENGINE_UPDATE_FAILED | A Service Engine software update failed. |
DELETED | The Service Instance has been deleted. |
DELETING | Deletion of the Service Instance has been initiated. |
ERROR | Creation of the Service Instance failed. |
FATAL | The Service Instance has at least one FATAL-level alert and no LOST_CONNECTIVITY alerts. |
INIT | A create operation has been initiated for the Service Instance. |
LOST_CONNECTIVITY | The Service Instance has at least one LOST_CONNECTIVITY alert. |
ONLINE | The Service Instance was created successfully, is operating, and has no outstanding alerts. |
OS_UPDATE_IN_PROGRESS | An operating system software update is in progress. |
OS_UPDATE_FAILED | An operating system software update failed. |
POWEREDOFF | Service Instance is powered off. |
POWEREDON | The Service Instance VM is running, but awaiting a health check. |
WARNING | The Service Instance has at least one WARNING-level alert and no LOST_CONNECTIVITY, FATAL, or CRITICAL alerts. |
The Status of a Service Instance is an indicator of the overall health of the instance. You can view instance status using the Data Management for VMware Tanzu console or API.
Perform the following procedure to examine the status of a Service Instance:
Select Databases from the left navigation pane.
This action displays the Databases view, a table that lists the provisioned database Service Instances.
Examine the databases listed in the table, identify the database of interest, and navigate to that table row.
Examine the Status.
The health of a Service Instance reflects the status of the services running in the VM, the status of certain resources that it consumes, and its connectivity to internal and external components. Service Instance health is directly related to alerts that Data Management for VMware Tanzu may have triggered for the instance. Refer to Service Instance Alerts for details on the types and severity of alerts that DMS may trigger for an instance.
Perform the following procedure to view the health of a Service Instance:
Select Databases from the left navigation pane.
This action displays the Databases view, a table that lists the provisioned database Service Instances.
Examine the databases listed in the table, identify the database of interest, and navigate to that table row.
Click the database Instance Name.
The database information Details tab displays.
Select the Monitoring tab.
This action displays monitoring data for the Service Instance.
View the Health Status information.
Select the drop down menu in the top right corner to change the time-series aggregation period.
Data Management for VMware Tanzu triggers an alert on a database Service Instance when it encounters connectivity or resource issues on the instance. You monitor these alerts in the Database Alerts view.
The Database Alerts view displays the following information:
Data Management for VMware Tanzu triggers alerts of the following levels:
Alerts that Data Management for VMware Tanzu may trigger on a Service Instance include the following:
Alert Name | Threshold | → Status | Alert Level | Triggered When |
---|---|---|---|---|
LOST_CONNECTIVITY | N/A | Lost Connectivity | CRITICAL | The Service Instance is unreachable by DMS. |
CPU_HEALTH_ALERT | 90% | Critical | CRITICAL | vCPU utilization has reached 90%. |
CPU_HEALTH_ALERT | 70% | Warning | WARNING | vCPU utilization has reached 70%. |
DATA_DISK_HEALTH_ALERT | 90% | Fatal | CRITICAL | The data disk has reached 90% capacity. |
DATA_DISK_HEALTH_ALERT | 70% | Warning | WARNING | The data disk has reached 70% capacity. |
DATABASE_BIN_LOG_ALERT | N/A | Critical | CRITICAL | The transactions logs on the Service Instance are not getting copied to local storage. |
DATABASE_BIN_LOG_CLOUD_SYNC_ALERT | N/A | Warning | WARNING | The transaction logs on local and cloud storage are not in sync. |
DATABASE_SERVICE_ALERT | N/A | Fatal | CRITICAL | The Service Instance database engine is down. |
MAX_CONNECTIONS_ALERT | 90% | Critical | CRITICAL | The number of open connections to the Service Instance is approaching the maximum. |
MAX_CONNECTIONS_ALERT | 80% | Warning | WARNING | The number of open connections to the Service Instance has reached 80% of the maximum. |
METRICS_ALERT | N/A | Warning | WARNING | DMS cannot pull metrics from the Service Instance. |
NTP_SYNC_ALERT | N/A | Critical | CRITICAL | The Service Instance NTP service is down or not in sync. |
SYSTEM_DISK_HEALTH_ALERT | 90% | Critical | CRITICAL | The system disk has reached 90% capacity. |
SYSTEM_DISK_HEALTH_ALERT | 70% | Warning | WARNING | The system disk has reached 70% capacity. |
TELEGRAF_SERVICE_ALERT | N/A | Warning | WARNING | The Telgraf service is not responding. |
In some cases, you can clear certain alerts by restarting the affected service on the Service Instance VM.
Alert Name | Affected VM OS Service Name |
---|---|
NTP_SYNC_ALERT | systemd-timesyncd |
TELEGRAF_SERVICE_ALERT | telegraf |
METRICS_ALERT | telegraf |
METRICS_ALERT1 | influxdb |
DATABASE_SERVICE_ALERT | dbengine |
1 If a METRICS_ALERT is raised on all Service Instances, restart the influxdb.service
on the Agent VM.
To clear an alert:
SSH into the Service Instance or Agent VM.
Restart the affected service. For example:
user@servinstvm$ systemctl restart telegraf.service
If the Agent VM triggers a DATABASE_BIN_LOG_ALERT on a Service Instance, verify: the connection to Local Storage, the data disk space, and the service Engine status.
Data Management for VMware Tanzu collects metric data for each Service Instance. You can view this data and use it to track the resource consumption, performance, and activity of your instances.
A service instance is created with NORMAL or ENHANCED monitoring. The metrics for which Data Management for VMware Tanzu collects data in ENHANCED monitoring mode includes the NORMAL metrics, plus additional service-specific metrics.
Data Management for VMware Tanzu displays the following DB Metrics when NORMAL monitoring is enabled for a Service Instance:
Additional PostgreSQL statistics displayed when ENHANCED monitoring is in effect include:
Additional MySQL statistics displayed when ENHANCED monitoring is in effect include:
You view metric data for a Service Instance in the Databases view, instance Monitoring tab, DB Metrics pane.
By default, Data Management for VMware Tanzu displays the last 3 hours of aggregated metric data. You can change this time period (calculated from current time) via a drop-down in the upper right corner of the DB Metrics tab.