Monitoring and KPIs for VMware SQL with MySQL for Tanzu Application Service

You can monitor the health of the VMware SQL with MySQL for Tanzu Application Service service using logs, metrics, and Key Performance Indicators (KPIs) that are generated by VMware Tanzu for MySQL component VMs.

For more information about logging and metrics in VMware Tanzu Application Service for VMs, see Logging and metrics.

About metrics

Metrics are regularly generated log entries that report measured component states. The default metrics polling interval is 30 seconds for MySQL instances. For the service broker, the default is 60 seconds.

You can configure the MySQL instance interval in Configure Monitoring in the VMware Tanzu for MySQL tile. For more information, see Configure monitoring.

Metrics are long, single lines of text with the format:

origin:"p.mysql" eventType:ValueMetric timestamp:1496776477930669688 deployment:"service-instance_2b5a001f-2bf3-460c-aee6-fd2253f9fb0c" job:"mysql" index:"b09df494-b731-4d06-a4b0-c2985ceedf4c" ip:"10.0.8.4" valueMetric:<name:"/p.mysql/performance/open_files" value:24 unit:"file" >

Access MySQL metrics

To access MySQL metrics:

Use Grafana. This method requires Healthwatch v2.x.
Use log cache

Use Grafana

You can use Grafana to visually view metrics for VMware Tanzu for MySQL service instances.

This section requires Healthwatch v2.x.

Install the Healthwatch tile in Tanzu Operations Manager. For information about installing Healthwatch, see the Healthwatch for VMware Tanzu documentation.
View the Grafana dashboard by going to:
```
https://grafana.YOUR-SYSTEM-DOMAIN
```

For more information about using Grafana dashboards, see the Healthwatch for VMware Tanzu documentation.

Use Log Cache

To access metrics for VMware Tanzu for MySQL service instances, you can use Loggregator’s Log Cache feature with the Log Cache CLI plug-in. Log Cache is enabled by default.

To access metrics for on-demand service instances:

Install the cf CLI plug-in by running:

cf install-plugin -r CF-Community "log-cache"

To access metrics for a service instance, run:

cf tail SERVICE-INSTANCE-NAME

Two useful flags to append to this command are:

-f, --follow: Append metrics logs to stdout as they are generated, rather than returning a fixed number of metrics or metrics over a fixed interval.
--json: Output metrics logs as envelopes in JSON format. For a complete list of cf tail options, see the Log Cache CLI repository in GitHub.

Where SERVICE-INSTANCE-NAME is the name of your service instance.

For example:

  $ cf tail -f my-instance | egrep 'connected|available|persistent_disk_used_percent'
  2019-05-17T11:25:59.48-0700 [my-instance] GAUGE /p.mysql/available:1.000000 boolean
  2019-05-17T11:26:29.49-0700 [my-instance] GAUGE /p.mysql/system/persistent_disk_used_percent:17.000000 percentage
  2019-05-17T11:26:29.49-0700 [my-instance] GAUGE /p.mysql/performance/threads_connected:**6**.000000 connection
  2019-05-17T11:26:59.50-0700 [my-instance] GAUGE /p.mysql/available:1.000000 boolean
  2019-05-17T11:27:29.50-0700 [my-instance] GAUGE /p.mysql/system/persistent_disk_used_percent:17.000000 percentage
  2019-05-17T11:27:29.50-0700 [my-instance] GAUGE /p.mysql/performance/threads_connected:**7**.000000 connection

For more information about the metrics output, see

For more information about how to enable Log Cache and about the cf tail command, see Enable Log Cache.

KPIs for MySQL service instances

KPIs are metrics for MySQL service instances that you can monitor for two purposes:

To ensure high performance.
To discover emerging issues.

KPIs can be either raw component metrics or derived metrics generated by applying formulas to raw metrics.

VMware provides the following KPIs as general alerting and response guidance for typical VMware Tanzu for MySQL installations. VMware recommends that you alter the alert measures by observing historical trends. You can also create your KPIs that are specific to your environment using the available component metrics.

For a list of all the VMware Tanzu for MySQL component metrics, see Component Metrics.

Server availability

/p.mysql/available
Description	If the MySQL Server is responding to requests. This indicates if the component is available. Use: If the server does not emit heartbeats, it is offline. Origin: Doppler/Firehose Type: Boolean Frequency: 30 s
Recommended measurement	Average over last 5 minutes
Recommended alert thresholds	Yellow warning: N/A Red critical: < 1
Recommended response	Check the MySQL Server logs for errors. You can find the instance by targeting your MySQL deployment with BOSH and inspecting logs for the instance. For more information, see Failing Jobs and Unhealthy Instances. If your service plan is an highly available (HA) cluster, you can also run `mysql-diag` to check logs for errors.

Persistent Disk Used

/p.mysql/system/persistent_disk_used_percent
Description	The percentage of disk used on the persistent file system. Use: MySQL cannot function correctly if there is not sufficient free space on the file systems. Use these metrics to ensure that you have disks large enough for your user base. Origin: Doppler/Firehose Type: Percent Frequency: 30 s (default)
Recommended measurement	Maximum of persistent disk used of all of nodes
Recommended alert thresholds	Single Node and Leader Follower: Yellow warning: > 25% Red critical: > 30% Highly Available Cluster: Yellow warning: > 80% Red critical: > 90%
Recommended response	Upgrade the service instance to a plan with larger disk capacity. For Tanzu SQL for VMs v2.9 and later, if you set the `optimize_for_short_words` parameter to `true`, then see Troubleshooting VMware Tanzu SQL with MySQL for VMs before upgrading the service.

Ephemeral Disk Used

/p.mysql/system/ephemeral_disk_used_percent
Description	The percentage of disk used on the ephemeral file system. Use: MySQL cannot function correctly if there is not sufficient free space on the file systems. Use these metrics to ensure that you have disks large enough for your user base. Origin: Doppler/Firehose Type: Percent Frequency: 30 s (default)
Recommended measurement	Maximum disk used of all nodes
Recommended alert thresholds	Yellow warning: > 80% Red critical: > 95%
Recommended response	Upgrade the service instance to a plan with larger disk capacity.

CPU use percentage

/p.mysql/performance/cpu_utilization_percent
Description	CPU time being consumed by the MySQL service. Use: A node that experiences context switching or high CPU use becomes unresponsive. This also affects the ability of the node to report metrics. Origin: Doppler/Firehose Type: Percent Frequency: 30 s (default)
Recommended measurement	Average over last 10 minutes
Recommended alert thresholds	Yellow warning: > 80 Red critical: > 90
Recommended response	Discover what is using so much CPU. If it is from normal processes, update the service instance to use a plan with larger CPU capacity.

Connections

/p.mysql/variables/max_connections /p.mysql/net/max_used_connections
Description	The maximum number of connections used over the maximum permitted number of simultaneous client connections. Use: If the number of connections drastically changes or if apps are unable to connect, there might be a network or app issue. Origin: Doppler/Firehose Type: count Frequency: 30 s
Recommended measurement	`max_used_connections` / `max_connections`
Recommended alert thresholds	Yellow warning: > 80 % Red critical: > 90 %
Recommended response	If this measurement meets or exceeds 80% with exponential growth, monitor app use to ensure that everything is working. When approaching 100% of maximum connections, apps might not always connect to the database. The connections/second for a service instance vary based on app instances and app use.

Queries Delta

/p.mysql/performance/queries_delta
Description	The number of statements executed by the server over the last 30 seconds. Use: The server always processes queries. If the server does not process queries, the server is non-functional. Origin: Doppler/Firehose Type: count Frequency: 30 s
Recommended measurement	Average over last 2 minutes
Recommended alert thresholds	Red critical: 0
Recommended response	Investigate the MySQL server logs, such as the audit log, to understand why query rate changed and decide on appropriate action.

Highly Available Cluster WSREP Ready

/p.mysql/galera/wsrep_ready
Description	Shows whether each cluster node can accept queries. Returns only 0 or 1. When this metric is 0, almost all queries to that node fail with the error: `ERROR 1047 (08501) Unknown Command` Use: Discover when nodes of a cluster were unable to communicate and accept transactions. Origin: Doppler/Firehose Type: Boolean Frequency: 30 s (default)
Recommended measurement	Average of values of each cluster node, over the last 5 minutes
Recommended alert thresholds	Yellow warning: < 1 Red critical: 0 (cluster is down)
Recommended response	Run `mysql-diag` and check the MySQL Server logs for errors. Ensure that no infrastructure event is affecting intra-cluster communication. Ensure that `wsrep_ready` is not set to off by using the query: `SHOW STATUS LIKE 'wsrep_ready';`.

Highly Available Cluster WSREP Cluster Size

/p.mysql/galera/wsrep_cluster_size
Description	The number of cluster nodes with which each node is communicating normally. Use: When running in a multi-node configuration, this metric indicates if each member of the cluster is communicating normally with all other nodes. Origin: Doppler/Firehose Type: count Frequency: 30 s (default)
Recommended measurement	(Average of the values of each node / cluster size), over the last 5 minutes
Recommended alert thresholds	Yellow warning: < 3 (availability compromised) Red critical: < 1 (cluster unavailable)
Recommended response	Run `mysql-diag` and check the MySQL Server logs for errors.

Highly Available Cluster WSREP Cluster Status

/p.mysql/galera/wsrep_cluster_status
Description	Shows the primary status of the cluster component that the node is in. Values are: `Primary`: Node has a quorum. `Non-primary`: Node has lost a quorum. `Disconnected`: Node is unable to connect to other nodes. Use: Any value other than `Primary` indicates that the node is part of a non-operational component. This occurs in cases of multiple membership changes that cause a loss of quorum. Origin: Doppler/Firehose Type: integer (see above) Frequency: 30 s (default)
Recommended measurement	Sum of each of the nodes, over the last 5 minutes
Recommended alert thresholds	Yellow warning: < 3 Red critical: < 1
Recommended response	Verify that all nodes are in working order and can receive write-sets Run `mysql-diag` and check the MySQL Server logs for errors

Hours Since Last Successful Backup

/p.mysql/p.mysql/last_successful_backup
Description	Using the configured backup schedule for the service instance as a threshold, this metric shows how many hours have passed since the last successful backup.
Recommended measurement	Hours elapsed since the last successful backup
Recommended alert thresholds	Red critical: Metric exceeds your organization's policy for maximum backup interval.
Recommended response	Check the adbr-agent logs for the service instance to determine why recent automated backups are failing.

Component Metrics

In addition to the KPIs, the MySQL service emits the followings metrics for monitoring and alerting:

MySQL metrics
Disk metrics
Leader-Follower metrics
Highly Available cluster Metrics

MySQL metrics

The metrics that all VMware Tanzu for MySQL service instances emit:

/p.mysql/available
Description	Indicates if the local database server is available and responding.
Unit	boolean

/p.mysql/variables/max_connections
Description	The maximum permitted number of simultaneous client connections.
Unit	count

/p.mysql/variables/open_files_limit
Description	The number of files that the operating system permits mysqld to open.
Unit	files

/p.mysql/variables/read_only
Description	Whether the server is in read-only mode
Unit	boolean

/p.mysql/performance/questions
Description	The number of statements run by the server since the server started or the last `FLUSH STATUS`. This includes only statements sent to the server by clients and not statements run within stored programs, unlike the `queries` variable.
Unit	count

/p.mysql/performance/queries
Description	The number of statements run by the server, excluding `COM_PING` and `COM_STATISTICS`. Differs from `questions` in that it also counts statements run within stored programs. Not affected by `FLUSH STATUS`.
Unit	count

/p.mysql/performance/queries_delta
Description	The change in the `/performance/queries` metric since the last time it was emitted.
Unit	integer greater than or equal to zero

/p.mysql/innodb/buffer_pool_pages_free
Description	The amount of free space, measured in pages, in the InnoDB Buffer Pool.
Unit	count

/p.mysql/innodb/buffer_pool_pages_total
Description	The total amount of free space, measured in pages, in the InnoDB Buffer Pool containing data.
Unit	count

/p.mysql/innodb/buffer_pool_pages_data
Description	The number of pages in the InnoDB Buffer Pool containing data. The number includes both dirty and clean pages.
Unit	count

/p.mysql/innodb/row_lock_current_waits
Description	The number of row locks currently being waited for by operations on InnoDB tables.
Unit	count

/p.mysql/innodb/data_read
Description	The amount of data read since the server started.
Unit	bytes

/p.mysql/innodb/data_written
Description	The amount of data written since the server started.
Unit	bytes

/p.mysql/innodb/mutex_os_waits
Description	The number of mutex OS waits.
Unit	events/second

/p.mysql/innodb/mutex_spin_rounds
Description	The number of mutex spin rounds.
Unit	events/second

/p.mysql/innodb/mutex_spin_waits
Description	The number of mutex spin waits.
Unit	events/second

/p.mysql/innodb/os_log_fsyncs
Description	The number of fsync() writes done to the InnoDB redo log files.
Unit	count

/p.mysql/innodb/row_lock_time
Description	Total time spent in acquiring row locks.
Unit	milliseconds

/p.mysql/innodb/row_lock_waits
Description	The number of times a row lock had to be waited for since the server started.
Unit	count

/p.mysql/net/connections
Description	The number of connection attempts to the server, both successful and unsuccessful, to the MySQL server.
Unit	count

/p.mysql/net/max_used_connections
Description	The maximum number of connections that have been in use simultaneously since the server started.
Unit	count

/p.mysql/performance/com_delete
Description	The number of delete commands since the server started or the last `FLUSH STATUS`.
Unit	count

/p.mysql/performance/com_delete_multi
Description	The number of delete-multi commands since the server started or the last `FLUSH STATUS`. Applies to DELETE statements that use multiple-table syntax.
Unit	count

/p.mysql/performance/com_insert
Description	The number of insert commands since the server started or the last `FLUSH STATUS`.
Unit	count

/p.mysql/performance/com_insert_select
Description	The number of insert-select commands since the server started or the last `FLUSH STATUS`.
Unit	count

/p.mysql/performance/com_replace_select
Description	The number of replace-select commands since the server started or the last `FLUSH STATUS`.
Unit	count

/p.mysql/performance/com_select
Description	The number of select commands since the server started or the last `FLUSH STATUS`. If a query result is returned from query cache, the server increments the `qcache_hits` status variable, not `com_select`.
Unit	count

/p.mysql/performance/com_update
Description	The number of update commands since the server started or the last `FLUSH STATUS`.
Unit	count

/p.mysql/performance/com_update_multi
Description	The number of update-multi commands since the server started or the last `FLUSH STATUS`. Applies to UPDATE statements that use multiple-table syntax.
Unit	count

/p.mysql/performance/created_tmp_disk_tables
Description	The number of internal on-disk temporary tables created by the server while executing statements.
Unit	count

/p.mysql/performance/created_tmp_files
Description	The number of temporary files created by mysqld.
Unit	count

/p.mysql/performance/created_tmp_tables
Description	The number of internal temporary tables created by the server while executing statements.
Unit	count

/p.mysql/performance/cpu_utilization_percent
Description	The percent of the CPU in use by all processes on the MySQL node.
Unit	percent

/p.mysql/performance/open_files
Description	The number of regular files currently open, which were opened by the server.
Unit	count

/p.mysql/performance/open_tables
Description	The number of tables that are currently open.
Unit	count

/p.mysql/performance/opened_tables
Description	The number of tables that have been opened.
Unit	count

/p.mysql/performance/open_table_definitions
Description	The number of currently cached table definitions or `.frm` files.
Unit	count

/p.mysql/performance/opened_table_definitions
Description	The number of `.frm` files that have been cached.
Unit	count

/p.mysql/performance/qcache_hits
Description	The number of query cache hits. The query cache and qcache_hits metric is deprecated as of MySQL v5.7.20.
Unit	count

/p.mysql/performance/slow_queries
Description	The number of queries that have taken more than `long_query_time` seconds.
Unit	count

/p.mysql/performance/table_locks_waited
Description	The total number of times that a request for a table lock could not be granted immediately and a wait was needed.
Unit	count

/p.mysql/performance/threads_connected
Description	The number of currently open connections.
Unit	count

/p.mysql/performance/threads_running
Description	The number of threads that are not sleeping.
Unit	count

/p.mysql/rpl_semi_sync_master_tx_avg_wait_time
Description	The average time the leader has waited for the follower to accept transactions.
Unit	microseconds

/p.mysql/rpl_semi_sync_master_no_tx
Description	The number of transactions committed without follower acknowledgement.
Unit	count

/p.mysql/rpl_semi_sync_master_wait_sessions
Description	The current number of connections waiting for a sync commit. For more information about sync replication, see About Synchronous Replication.
Unit	count

Disk metrics

The disk usage metrics that all VMware Tanzu for MySQL services emit:

/p.mysql/system/persistent_disk_used_percent
Description	The percentage of disk used on the persistent file system.
Unit	percent

/p.mysql/system/persistent_disk_used
Description	The amount of space used on the persistent disk.
Unit	KB

/p.mysql/system/persistent_disk_free
Description	The amount of space available on the persistent disk.
Unit	KB

/p.mysql/system/persistent_disk_inodes_used_percent
Description	The percentage of persistent disk inodes used by both the system and user applications.
Unit	percent

/p.mysql/system/persistent_disk_inodes_used
Description	The number of inodes used on the persistent disk.
Unit	count

/p.mysql/system/persistent_disk_inodes_free
Description	The number of inodes available on the persistent disk.
Unit	count

/p.mysql/system/ephemeral_disk_used_percent
Description	The percentage of disk used on the ephemeral file system.
Unit	percent

/p.mysql/system/ephemeral_disk_used
Description	The amount of space used on the ephemeral disk.
Unit	KB

/p.mysql/system/ephemeral_disk_free
Description	The amount of space available on the ephemeral disk.
Unit	KB

/p.mysql/system/ephemeral_disk_inodes_used_percent
Description	The percentage of ephemeral disk inodes used by both the system and user applications.
Unit	percent

/p.mysql/system/ephemeral_disk_inodes_used
Description	The number of inodes used on the ephemeral disk.
Unit	count

/p.mysql/system/ephemeral_disk_inodes_free
Description	The number of inodes available on the ephemeral disk.
Unit	count

Leader-Follower metrics

The metrics that leader-follower VMs emit:

/p.mysql/follower/is_follower
Description	Shows whether a node is the follower VM.
Unit	boolean

/p.mysql/follower/seconds_behind_master
Description	The number of seconds the follower is behind in applying writes from the leader. For example, a follower might have copied writes from the leader that are timestamped up to 11:00am, but has only applied writes up to 8:00am. Normal values for this metric depend on your apps.
Unit	seconds

/p.mysql/follower/seconds_since_leader_heartbeat
Description	The number of seconds that elapse between the leader heartbeat and the replication of the heartbeat in the follower. You can use this metric to determine how far behind the follower is from the leader. Normal values for this metric depends on your app.
Unit	seconds

/p.mysql/follower/relay_log_space
Description	The total size of all existing relay log files.
Unit	bytes

/p.mysql/follower/slave_io_running
Description	Shows whether the I/O thread has started and has connected to the leader VM.
Unit	boolean

/p.mysql/follower/slave_sql_running
Description	Shows whether the SQL thread has started.
Unit	boolean

Highly Available Cluster Metrics

The metrics that HA clusters emit:

/p.mysql/galera/wsrep_cluster_size
Description	The current number of nodes in the HA cluster.
Unit	count

/p.mysql/galera/wsrep_local_recv_queue
Description	The current length of the local receive queue, in messages.
Unit	count

/p.mysql/galera/wsrep_local_send_queue
Description	The current length of the local send queue, in messages.
Unit	count

/p.mysql/galera/wsrep_local_index
Description	This node index in the cluster (base 0).
Unit	count

/p.mysql/galera/wsrep_local_state
Description	The local state of the node. Possible states include: 1 = `JOINING` 2 = `DONOR/DESYNCED` 3 = `JOINED` 4 = `SYNCED`
Unit	integer

/p.mysql/galera/wsrep_ready
Description	Shows whether the node can accept queries.
Unit	boolean

/p.mysql/galera/wsrep_cluster_status
Description	Shows the primary status of the cluster component that the node is in. Values are: `Primary`: Node has a quorum. `Non-primary`: Node has lost a quorum. `Disconnected`: Node is unable to connect to other nodes.
Unit	Status code

/p.mysql/galera/wsrep_flow_control_paused
Description	Proportion of time, as a unit interval (0 to 1), that replication was paused due to flow control since the server started or last `FLUSH STATUS`. This metric is a measure of how much replication lag is slowing down the cluster.
Unit	float

/p.mysql/galera/wsrep_flow_control_sent
Description	Number of `FC_PAUSE` or flow control pause events sent by this node. Unlike many status variables, the counter for this metric does not reset every time you run the query.
Unit	count

/p.mysql/galera/wsrep_flow_control_recv
Description	Number of `FC_PAUSE` or flow control pause events received by this node. Unlike many status variables, the counter for this metric does not reset every time you run the query.
Unit	count