The KPIs described here are provided for operators to give general guidance on monitoring a TAS for VMs deployment using platform component and system (BOSH) metrics. Although many metrics are emitted from the platform, the KPIs are high-signal-value metrics that can indicate emerging platform issues.
This alerting and response guidance has been shown to apply to most deployments. VMware recommends that operators continue to fine-tune the alert measures to their deployment by observing historical trends. VMware also recommends that operators expand beyond this guidance and create new, deployment-specific monitoring metrics, thresholds, and alerts based on learning from their deployments.
Thresholds noted as “dynamic” in the following tables indicate that while a metric is highly important to watch, the relative numbers to set threshold warnings at are specific to a given TAS for VMs deployment and its use cases. These dynamic thresholds should be occasionally revisited because the foundation and its usage continue to evolve. For more information, see Warning and Critical Thresholds in Selecting and Configuring a Monitoring System.
While the performance impact on TAS for VMs is considered when building new features, VMware does not perform discrete load and scale testing on a regular basis. It is impossible to test each configuration of TAS for VMs with every type of infrastructure and app workload. Therefore, VMware recommends using the following KPIs to identify when to scale your system.
For more information about accessing metrics used in these key performance indicators, see Overview of Logging and Metrics.
These sections describe Diego Auctioneer metrics.
auctioneer.AuctioneerLRPAuctionsFailed |
|
---|---|
Description | The number of Long Running Process (LRP) instances that the Auctioneer failed to place on Diego Cells. This metric is cumulative over the lifetime of the Auctioneer job. Use: This metric can indicate that TAS for VMs is out of container space or that there is a lack of resources within your environment. This indicator also increases when the LRP is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work. This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled. This error is most common due to capacity issues. For example, if Diego Cells do not have enough resources, or if Diego Cells are going back and forth between a healthy and unhealthy state. Origin: Firehose Type: Counter (Integer) Frequency: During each auction |
Recommended measurement | Per minute delta averaged over a 5-minute window |
Recommended alert thresholds | Yellow warning: ≥ 0.5 Red critical: ≥ 1 |
Recommended response |
|
auctioneer.AuctioneerFetchStatesDuration |
|
---|---|
Description | Time in ns that the Auctioneer took to fetch state from all the Diego Cells when running its auction. Use: Indicates how the Diego Cells themselves are performing. Alerting on this metric helps alert that app staging requests to Diego might be failing. Origin: Firehose Type: Gauge, integer in ns Frequency: During event, during each auction |
Recommended measurement | Maximum over the last 5 minutes divided by 1,000,000,000 |
Recommended alert thresholds | Yellow warning: ≥ 2 s Red critical: ≥ 5 s |
Recommended response |
|
auctioneer.AuctioneerLRPAuctionsStarted |
|
---|---|
Description | The number of LRP instances that the Auctioneer successfully placed on Diego Cells. This metric is cumulative over the lifetime of the Auctioneer job. Use: Provides a sense of running system activity levels in your environment. Can also give you a sense of how many app instances have been started over time. The measurement VMware recommends can help indicate a significant amount of container churn. However, for capacity planning purposes, it is more helpful to observe deltas over a long time window. This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled. Origin: Firehose Type: Counter (Integer) Frequency: During event, during each auction |
Recommended measurement | Per minute delta averaged over a 5-minute window |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response | When observing a significant amount of container churn:
|
auctioneer.AuctioneerTaskAuctionsFailed |
|
---|---|
Description | The number of Tasks that the Auctioneer failed to place on Diego Cells. This metric is cumulative over the lifetime of the Auctioneer job. Use: Failing Task auctions indicate a lack of resources within your environment and that you likely need to scale. This indicator also increases when the Task is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work. This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no tasks being scheduled. This error is most common due to capacity issues. For example, if Diego Cells do not have enough resources, or if Diego Cells are going back and forth between a healthy and unhealthy state. Origin: Firehose Type: Counter (Float) Frequency: During event, during each auction |
Recommended measurement | Per minute delta averaged over a 5-minute window |
Recommended alert thresholds | Yellow warning: ≥ 0.5 Red critical: ≥ 1 |
Recommended response |
|
These sections describe Diego BBS metrics.
bbs.ConvergenceLRPDuration |
|
---|---|
Description | Time in ns that the BBS took to run its LRP convergence pass. Use: If the convergence run begins taking too long, apps or tasks might fail without restarting. This symptom can also indicate loss of connectivity to the BBS database. Origin: Firehose Type: Gauge (Integer in ns) Frequency: During event, every 30 seconds when LRP convergence runs, emission should be near-constant on a running deployment |
Recommended measurement | Maximum over the last 15 minutes divided by 1,000,000,000 |
Recommended alert thresholds | Yellow warning: ≥ 10 s Red critical: ≥ 20 s |
Recommended response |
|
bbs.RequestLatency |
|
---|---|
Description | The maximum observed latency time over the past 60 seconds that the BBS took to handle requests across all its API endpoints. Diego now aggregates this metric to emit the maximum value observed over 60 seconds. Use: If this metric rises, the TAS for VMs API is slowing. Response to certain cf CLI commands is slow if request latency is high. Origin: Firehose Type: Gauge (Integer in ns) Frequency: 60 s |
Recommended measurement | Average over the last 15 minutes divided by 1,000,000,000 |
Recommended alert thresholds | Yellow warning: ≥ 5 s Red critical: ≥ 10 s |
Recommended response |
|
bbs.Domain.cf-apps |
|
---|---|
Description | Indicates if the cf-apps Domain is up-to-date, meaning that TAS for VMs app requests from Cloud Controller are synchronized to bbs.LRPsDesired (Diego-desired AIs) for execution.
cf-apps Domain does not stay up-to-date, changes requested in the Cloud Controller are not guaranteed to propagate throughout the system. If the Cloud Controller and Diego are out of sync, then apps running can vary from those desired.Origin: Firehose Type: Gauge (Float) Frequency: 30 s |
Recommended measurement | Value over the last 5 minutes |
Recommended alert thresholds | Yellow warning: N/A Red critical: ≠ 1 The threshold value VMware recommends represents a state where an up-to-date metric 1 has not been received for the entire 5-minute window. |
Recommended response |
|
bbs.LRPsExtra |
|
---|---|
Description | Total number of LRP instances that are no longer desired but still have a BBS record. When Diego wants to add more apps, the BBS sends a request to the Auctioneer to spin up additional LRPs. LRPsExtra is the total number of LRP instances that are no longer desired but still have a BBS record. Use: If Diego has more LRPs running than expected, there might be problems with the BBS. Deleting an app with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsExtra is unusual and should be investigated.Origin: Firehose Type: Gauge (Float) Frequency: 30 s |
Recommended measurement | Average over the last 5 minutes |
Recommended alert thresholds | Yellow warning: ≥ 5 Red critical: ≥ 10 |
Recommended response |
|
bbs.LRPsMissing |
|
---|---|
Description | Total number of LRP instances that are desired but have no record in the BBS. When Diego wants to add more apps, the BBS sends a request to the Auctioneer to spin up additional LRPs. LRPsMissing is the total number of LRP instances that are desired but have no BBS record. Use: If Diego has less LRP running than expected, there might be problems with the BBS. An app push with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsMissing is unusual and should be investigated.Origin: Firehose Type: Gauge (Float) Frequency: 30 s |
Recommended measurement | Average over the last 5 minutes |
Recommended alert thresholds | Yellow warning: ≥ 5 Red critical: ≥ 10 |
Recommended response |
|
bbs.CrashedActualLRPs |
|
---|---|
Description | Total number of LRP instances that have crashed. Use: Indicates how many instances in the deployment are in a crashed state. An increase in bbs.CrashedActualLRPs can indicate several problems, from a bad app with many instances associated, to a platform issue that is resulting in app crashes. Use this metric to help create a baseline for your deployment. After you have a baseline, you can create a deployment-specific alert to notify of a spike in crashes higher than the trend line. Tune alert values to your deployment.Origin: Firehose Type: Gauge (Float) Frequency: 30 s |
Recommended measurement | Average over the last 5 minutes |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response |
|
1 hr average of bbs.LRPsRunning – prior 1 hr average of bbs.LRPsRunning |
|
---|---|
Description | Rate of change in app instances being started or stopped on the platform. It is derived from bbs.LRPsRunning and represents the total number of LRP instances that are running on Diego Cells.Use: Delta reflects upward or downward trend for app instances started or stopped. Helps to provide a picture of the overall growth trend of the environment for capacity planning. You might want to alert on delta values outside of the expected range. Origin: Firehose Type: Gauge (Float) Frequency: During event, emission must be constant on a running deployment |
Recommended measurement | derived=(1-hour average of bbs.LRPsRunning – prior 1-hour average of bbs.LRPsRunning ) |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response | Scale components as necessary. |
bbs.BBSMasterElected |
|
---|---|
Description | Indicates when there is a BBS master election. A BBS master election takes place when a BBS instance has taken over as the active instance. A value of 1 is emitted when the election takes place.Use: This metric emits when a redeployment of the BBS occurs. If this metric is emitted frequently outside of a deployment, this might be a signal of underlying problems that should be investigated. If the active BBS is continually changing, this can cause app push downtime. Origin: Firehose Type: Gauge (Float) Frequency: On event |
Recommended measurement | N/A, the most effective visualization is as a stacked bar chart |
Recommended alert thresholds | Yellow warning: N/A Red critical: N/A |
Recommended response |
|
These sections describe Diego Cell metrics.
rep.CapacityRemainingMemory |
|
---|---|
Description | Remaining amount of memory in MiB available for this Diego Cell to allocate to containers. Use: Indicates the available Diego Cell memory. Insufficient Diego Cell memory can prevent pushing and scaling apps. The strongest operational value of this metric is to interpert a deployment's average app size and monitor/alert on ensuring that at least some Cells have large enough capacity to accept standard app size pushes. For example, if pushing a 4 GB app, Diego can have trouble placing that app if there is no one Diego Cell with sufficient capacity of 4 GB or greater. As an example, VMware Cloud Ops uses a standard of 4 GB, and computes and monitors for the number of Diego Cells with at least 4 GB free. When the number of Diego Cells with at least 4 GB falls below a defined threshold, this is a scaling indicator alert to increase capacity. This free chunk count threshold should be tuned to the deployment size and the standard size of apps being pushed to the deployment. Origin: Firehose Type: Gauge (Integer in MiB) Frequency: 60 s |
Recommended measurement | For alerting:
Looking at this metric ( rep.CapacityRemainingMemory ) as a minimum value per Diego Cell has more informational value than alerting value. It can be an interesting heatmap visualization, showing average variance and density over time. |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response |
|
Alternative Metric | If you are using Healthwatch, VMware recommends using the metric healthwatch.Diego.AvailableFreeChunks . For more information, see Healthwatch Metrics in the Healthwatch documentation. |
rep.CapacityRemainingMemory (Alternative Use) |
|
---|---|
Description | Remaining amount of memory in MiB available for this Diego Cell to allocate to containers. Use: Can indicate low memory capacity overall in the platform. Low memory can prevent app scaling and new deployments. The overall sum of capacity can indicate that you need to scale the platform. Observing capacity consumption trends over time helps with capacity planning. Origin: Firehose Type: Gauge (Integer in MiB) Frequency: 60 s |
Recommended measurement | Minimum over the last 5 minutes divided by 1024 (across all instances) |
Recommended alert thresholds | Yellow warning: ≤ 64 GB Red critical: ≤ 32 GB |
Recommended response |
|
Alternative Metric | If you are using Healthwatch, VMware recommends the metric healthwatch.Diego.AvailableFreeChunks for this purpose. For more information, see Healthwatch Metrics in the Healthwatch documentation. |
rep.CapacityRemainingDisk |
|
---|---|
Description | Remaining amount of disk in MiB available for this Diego Cell to allocate to containers. Use: Indicates the available Diego Cell disk. Insufficient free disk on Diego Cells prevents the staging or starting of apps or tasks, resulting in error messages like ERR Failed to stage app: insufficient resources .Because Diego fails to stage without at least 6 GB free, unreserved disk space on a given Diego Cell, the strongest operational value of this metric is to ensure that at least some Diego Cells have a large enough disk capacity to support the staging of apps and tasks. VMware recommends computing and monitoring for the number of Diego Cells with at least 6 GB Disk free. When the number of Diego Cells with at least 6 GB falls below a defined threshold, this is a scaling indicator alert to increase capacity. The alerting threshold value for the amount of free chunks of Disk should be tuned to the deployment size and the standard size of apps being pushed to the deployment. Origin: Firehose Type: Gauge (Integer in MiB) Frequency: 60 s |
Recommended measurement | For alerting:
Looking at this metric ( rep.CapacityRemainingDisk ) as a minimum value per Diego Cell has more informational value than alerting value. It can be an interesting heatmap visualization, showing average variance and density over time. |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response |
|
Alternative Metric | If you are using Healthwatch, VMware recommends the metric healthwatch.Diego.AvailableFreeChunks for this purpose. For more information, see Healthwatch Metrics in the Healthwatch documentation. |
rep.CapacityRemainingDisk (Alternative Use) |
|
---|---|
Description | Remaining amount of disk in MiB available for this Diego Cell to allocate to containers. Use: Low disk capacity can prevent app scaling and new deployments. Because Diego staging Tasks can fail without at least 6 GB free, the red threshold VMware recommends is based on the minimum disk capacity across the deployment falling below 6 GB in the previous 5 minutes. It can also be advantageous to assess how many chunks of free disk space are higher than a given threshold, similar to rep.CapacityRemainingMemory .Origin: Firehose Type: Gauge (Integer in MiB) Frequency: 60 s |
Recommended measurement | Minimum over the last 5 minutes divided by 1024 (across all instances) |
Recommended alert thresholds | Yellow warning: ≤ 12 GB Red critical: ≤ 6 GB |
Recommended response |
|
Alternative Metric | If you are using Healthwatch, VMware recommends the metric healthwatch.Diego.AvailableFreeChunks for this purpose. For more information, see Healthwatch Metrics in the Healthwatch documentation. |
rep.RepBulkSyncDuration |
|
---|---|
Description | Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual Garden containers. Use: Sync times that are too high can indicate issues with the BBS. Origin: Firehose Type: Gauge (Float in ns) Frequency: 30 s |
Recommended measurement | Maximum over the last 15 minutes divided by 1,000,000,000 |
Recommended alert thresholds | Yellow warning: ≥ 5 s Red critical: ≥ 10 s |
Recommended response |
|
rep.GardenHealthCheckFailed |
|
---|---|
Description | The Diego Cell periodically checks its health against the Garden back end. For Diego Cells, 0 means healthy, and 1 means unhealthy.Use: Set an alert for further investigation if multiple unhealthy Diego Cells are detected in the given time window. If one Diego Cell is impacted, it does not participate in auctions, but end-user impact is usually low. If multiple Diego Cells are impacted, this can indicate a larger problem with Diego, and should be considered a more critical investigation need. Suggested alert threshold based on multiple unhealthy Diego Cells in the given time window. Although end-user impact is usually low if only one Diego Cell is impacted, this should still be investigated. Particularly in a lower capacity environment, this situation can result in negative end-user impact if left unresolved. Origin: Firehose Type: Gauge (Float, 0-1) Frequency: 30 s |
Recommended measurement | Maximum over the last 5 minutes |
Recommended alert thresholds | Yellow warning: = 1 Red critical: > 1 |
Recommended response |
|
These sections describe Diego Locket metrics.
bbs.LockHeld |
|
---|---|
Description | Whether a BBS instance holds the expected BBS lock (in Locket). 1 means the active BBS server holds the lock, and 0 means the lock was lost. Use: This metric is complimentary to Active Locks, and it offers a BBS-level version of the Locket metrics. Although it is emitted per BBS instance, only 1 active lock is held by BBS. Therefore, the expected value is 1. The metric might be 0 when the BBS instances are performing a leader transition, but a prolonged value of 0 indicates an issue with BBS. Origin: Firehose Type: Gauge Frequency: Periodically |
Recommended measurement | Maximum over the last 5 minutes |
Recommended alert thresholds | Yellow warning: N/A Red critical: ≠ 1 |
Recommended response |
|
auctioneer.LockHeld |
|
---|---|
Description | Whether an Auctioneer instance holds the expected Auctioneer lock (in Locket). 1 means the active Auctioneer holds the lock, and 0 means the lock was lost. Use: This metric is complimentary to Active Locks, and it offers an Auctioneer-level version of the Locket metrics. Although it is emitted per Auctioneer instance, only 1 active lock is held by Auctioneer. Therefore, the expected value is 1. The metric might be 0 when the Auctioneer instances are performing a leader transition, but a prolonged value of 0 indicates an issue with Auctioneer. Origin: Firehose Type: Gauge Frequency: Periodically |
Recommended measurement | Maximum over the last 5 minutes |
Recommended alert thresholds | Yellow warning: N/A Red critical: ≠ 1 |
Recommended response |
|
locket.ActivePresences |
|
---|---|
Description | Total count of active presences. Presences are defined as the registration records that the Diego Cells maintain to advertise themselves to the platform. Use: If the Active Presences count is far from the expected, there might be a problem with Diego. The number of active presences varies according to the number of Diego Cells deployed. Therefore, during purposeful scale adjustments to TAS for VMs, this alerting threshold should be adjusted. Establish an initial threshold by observing the historical trends for the deployment over a brief period of time, Increase the threshold as more Diego Cells are deployed. During a rolling deploy, this metric shows variance during the BOSH lifecycle when Diego Cells are evacuated and restarted. Tolerable variance is within the bounds of the BOSH maximum in-flight range for the instance group. Origin: Firehose Type: Gauge Frequency: 60 s |
Recommended measurement | Maximum over the last 15 minutes |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response |
|
These sections describe Diego Route Emitter metrics.
route_emitter.RouteEmitterSyncDuration |
|
---|---|
Description | Time in ns that the active Route Emitter took to perform its synchronization pass. Use: Increases in this metric indicate that the Route Emitter might have trouble maintaining an accurate routing table to broadcast to the Gorouters. Tune alerting values to your deployment based on historical data and adjust based on observations over time. The suggested starting point is ≥ 5 for the yellow threshold and ≥ 10 for the critical threshold. Origin: Firehose Type: Gauge (Float in ns) Frequency: 60 s |
Recommended measurement | Maximum, per job, over the last 15 minutes divided by 1,000,000,000 |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response | If all or many jobs showing as impacted, there is likely an issue with Diego.
|
These sections describe TAS for VMs MySQL KPIs.
When TAS for VMs uses an internal MySQL database, as configured in the Databases pane of the TAS for VMs tile, the database cluster generates KPIs as described here.
This section assumes you are using the Internal Databases - MySQL - Percona XtraDB Cluster option as your system database.
/mysql/available |
|
---|---|
Description | If the MySQL Server is responding to requests. This indicates if the component is available. Use: If the server does not emit heartbeats, it is offline. Origin: Doppler/Firehose Type: Boolean Frequency: 30 s |
Recommended measurement | Average over last 5 minutes |
Recommended alert thresholds | Yellow warning: N/A Red critical: < 1 |
Recommended response | Check the MySQL Server logs for errors. You can find the instance by targeting your MySQL deployment with BOSH and inspecting logs for the instance. For more information, see Failing Jobs and Unhealthy Instances. If your service plan is an highly available (HA) cluster, you can also run mysql-diag to check logs for errors. |
/mysql/system/persistent_disk_used_percent |
|
---|---|
Description | The percentage of disk used on the persistent file system. Use: MySQL cannot function correctly if there is not sufficient free space on the file systems. Use these metrics to ensure that you have disks large enough for your user base. Origin: Doppler/Firehose Type: Percent Frequency: 30 s (default) |
Recommended measurement | Maximum of persistent disk used of all of nodes |
Recommended alert thresholds | Single Node and Leader Follower:
|
Recommended response | Upgrade the service instance to a plan with larger disk capacity. For Tanzu SQL for VMs v2.9 and later, if you set the optimize_for_short_words parameter to true , then see Troubleshooting VMware Tanzu SQL with MySQL for VMs before upgrading the service. |
/mysql/system/ephemeral_disk_used_percent |
|
---|---|
Description | The percentage of disk used on the ephemeral file system. Use: MySQL cannot function correctly if there is not sufficient free space on the file systems. Use these metrics to ensure that you have disks large enough for your user base. Origin: Doppler/Firehose Type: Percent Frequency: 30 s (default) |
Recommended measurement | Maximum disk used of all nodes |
Recommended alert thresholds | Yellow warning: > 80% Red critical: > 95% |
Recommended response | Upgrade the service instance to a plan with larger disk capacity. |
/mysql/performance/cpu_utilization_percent |
|
---|---|
Description | CPU time being consumed by the MySQL service. Use: A node that experiences context switching or high CPU use becomes unresponsive. This also affects the ability of the node to report metrics. Origin: Doppler/Firehose Type: Percent Frequency: 30 s (default) |
Recommended measurement | Average over last 10 minutes |
Recommended alert thresholds | Yellow warning: > 80 Red critical: > 90 |
Recommended response | Discover what is using so much CPU. If it is from normal processes, update the service instance to use a plan with larger CPU capacity. |
/mysql/variables/max_connections /p.mysql/net/max_used_connections |
|
---|---|
Description | The maximum number of connections used over the maximum permitted number of simultaneous client connections. Use: If the number of connections drastically changes or if apps are unable to connect, there might be a network or app issue. Origin: Doppler/Firehose Type: count Frequency: 30 s |
Recommended measurement | max_used_connections / max_connections |
Recommended alert thresholds | Yellow warning: > 80 % Red critical: > 90 % |
Recommended response | If this measurement meets or exceeds 80% with exponential growth, monitor app use to ensure that everything is working. When approaching 100% of maximum connections, apps might not always connect to the database. The connections/second for a service instance vary based on app instances and app use. |
/mysql/performance/queries_delta |
|
---|---|
Description | The number of statements executed by the server over the last 30 seconds. Use: The server always processes queries. If the server does not process queries, the server is non-functional. Origin: Doppler/Firehose Type: count Frequency: 30 s |
Recommended measurement | Average over last 2 minutes |
Recommended alert thresholds | Red critical: 0 |
Recommended response | Investigate the MySQL server logs, such as the audit log, to understand why query rate changed and decide on appropriate action. |
/mysql/galera/wsrep_ready |
|
---|---|
Description | Shows whether each cluster node can accept queries. Returns only 0 or 1. When this metric is 0, almost all queries to that node fail with the error:ERROR 1047 (08501) Unknown Command Use: Discover when nodes of a cluster were unable to communicate and accept transactions. Origin: Doppler/Firehose Type: Boolean Frequency: 30 s (default) |
Recommended measurement | Average of values of each cluster node, over the last 5 minutes |
Recommended alert thresholds | Yellow warning: < 1 Red critical: 0 (cluster is down) |
Recommended response |
|
/mysql/galera/wsrep_cluster_size |
|
---|---|
Description | The number of cluster nodes with which each node is communicating normally. Use: When running in a multi-node configuration, this metric indicates if each member of the cluster is communicating normally with all other nodes. Origin: Doppler/Firehose Type: count Frequency: 30 s (default) |
Recommended measurement | (Average of the values of each node / cluster size), over the last 5 minutes |
Recommended alert thresholds | Yellow warning: < 3 (availability compromised) Red critical: < 1 (cluster unavailable) |
Recommended response | Run mysql-diag and check the MySQL Server logs for errors. |
/mysql/galera/wsrep_cluster_status |
|
---|---|
Description | Shows the primary status of the cluster component that the node is in. Values are:
Primary indicates that the node is part of a non-operational component. This occurs in cases of multiple membership changes that cause a loss of quorum. Origin: Doppler/Firehose Type: integer (see above) Frequency: 30 s (default) |
Recommended measurement | Sum of each of the nodes, over the last 5 minutes |
Recommended alert thresholds | Yellow warning: < 3 Red critical: < 1 |
Recommended response |
|
These sections describe Gorouter metrics.
gorouter.file_descriptors |
|
---|---|
Description | The number of file descriptors currently used by the Gorouter job. Use: Indicates an impending issue with the Gorouter. Without proper mitigation, it is possible for an unresponsive app to eventually exhaust available Gorouter file descriptors and cause route starvation for other apps running on TAS for VMs. Under heavy load, this unmitigated situation can also result in the Gorouter losing its connection to NATS and all routes being pruned. While a drop in gorouter.total_routes or an increase in gorouter.ms_since_last_registry_update helps to surface that the issue might already be occurring, alerting on gorouter.file_descriptors indicates that such an issue is impending.The Gorouter limits the number of file descriptors to 100,000 per job. Once the limit is met, the Gorouter is unable to establish any new connections. To reduce the risk of DDoS attacks, VMware recommends doing one or both of the following:
Type: Gauge Frequency: 5 s |
Recommended measurement | Maximum, per Gorouter job, over the last 5 minutes |
Recommended alert thresholds | Yellow warning: 50,000 per job Red critical: 60,000 per job |
Recommended response |
|
gorouter.backend_exhausted_conns |
|
---|---|
Description | The lifetime number of requests that have been rejected by the Gorouter VM due to the Max Connections Per Backend limit being reached across all tried back ends. The limit controls the number of concurrent TCP connections to any particular app instance and is configured within TAS for VMs.Use: Indicates that TAS for VMs is mitigating risk to other apps by self-protecting the platform against one or more unresponsive apps. Increases in this metric indicate the need to investigate and resolve issues with potentially unresponsive apps. A rapid rate of change upward is concerning and should be assessed further. Origin: Firehose Type: Counter (Integer) Frequency: 5 s |
Recommended measurement | Maximum delta per minute, per Gorouter job, over a 5-minute window |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response |
|
gorouter.total_requests |
|
---|---|
Description | The lifetime number of requests completed by the Gorouter VM, emitted per Gorouter instance Use: The aggregation of these values across all Gorouters provide insight into the overall traffic flow of a deployment. Unusually high spikes, if not known to be associated with an expected increase in demand, can indicate a DDoS risk. For performance and capacity management, consider this metric a measure of router throughput per job, converting it to requests-per-second, by looking at the delta value of gorouter.total_requests and deriving back to 1s, or gorouter.total_requests.delta)/5 , per Gorouter instance. This helps you see trends in the throughput rate that indicate a need to scale the Gorouter instances. Use the trends you observe to tune the threshold alerts for this metric.Origin: Firehose Type: Counter (Integer) Frequency: 5 s |
Recommended measurement | Average over the last 5 minutes of the derived per second calculation |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response | For optimizing the Gorouter, consider the requests-per-second derived metric in the context of router latency and Gorouter VM CPU utilization. From performance and load testing of the Gorouter, VMware has observed that at approximately 2500 simple requests per second, latency can begin to increase. This number changes based on your traffic profile and VM capabilities. To increase throughput and maintain low latency, scale the Gorouters either horizontally or vertically and ensure that the system.cpu.user metric for the Gorouter stays in the suggested range of 60-70% CPU Utilization. For more information about the system.cpu.user metric, see VM CPU Utilization. |
gorouter.latency |
|
---|---|
Description | The time in milliseconds that represents the length of a request from the Gorouter's point of view. This timing starts when Gorouter recieves a request and stops when Gorouter finishes processing the response from the app. Long uploads, downloads, or app responses increase this time. This metric includes the request time to all back end endpoints, including both apps and routable system components like Cloud Controller and UAA. Use: Indicates the traffic profile of TAS for VMs. An alert value on this metric should be tuned to the specifics of the deployment and its underlying network considerations; a suggested starting point is 100 ms. Origin: Firehose Type: Gauge (Float in ms) Frequency: Emitted per Gorouter request, emission should be constant on a running deployment |
Recommended measurement | Average over the last 30 minutes |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response | Extended periods of high latency can point to several factors. The Gorouter latency measure includes network and back end latency impacts as well.
|
gorouter.ms_since_last_registry_update |
|
---|---|
Description | Time in milliseconds since the last route register was received, emitted per Gorouter instance Use: Indicates if routes are not being registered to apps correctly. Origin: Firehose Type: Gauge (Float in ms) Frequency: 30 s |
Recommended measurement | Maximum over the last 5 minutes |
Recommended alert thresholds | Yellow warning: N/A Red critical: > 30,000 This threshold is suitable for normal platform usage. It alerts if it has been at least 30 seconds since the Gorouter last received a message from an app. |
Recommended response |
|
gorouter.bad_gateways |
|
---|---|
Description | The lifetime number of bad gateways, or 502 responses, from the Gorouter itself, emitted per Gorouter instance. The Gorouter emits a 502 bad gateway error when it has a route in the routing table and, in attempting to make a connection to the back end, finds that the back end does not exist. Use: Indicates that route tables might be stale. Stale routing tables suggest an issue in the route register management plane, which indicates that something has likely changed with the locations of the containers. Always investigate unexpected increases in this metric. Origin: Firehose Type: Count (Integer, Lifetime) Frequency: 5 s |
Recommended measurement | Maximum delta per minute over a 5-minute window |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response |
|
gorouter.responses.5xx |
|
---|---|
Description | The lifetime number of requests completed by the Gorouter VM for HTTP status family 5xx, server errors, emitted per Gorouter instance. Use: A repeatedly crashing app is often the cause of a big increase in 5xx responses. However, response issues from apps can also cause an increase in 5xx responses. Always investigate an unexpected increase in this metric. Origin: Firehose Type: Counter (Integer) Frequency: 5 s |
Recommended measurement | Maximum delta per minute over a 5-minute window |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response |
|
gorouter.total_routes |
|
---|---|
Description | The current total number of routes registered with the Gorouter, emitted per Gorouter instance Use: The aggregation of these values across all Gorouters indicates uptake and gives a picture of the overall growth of the environment for capacity planning. VMware also recommends alerting on this metric if the number of routes falls outside of the normal range for your deployment. Dramatic decreases in this metric volume can indicate a problem with the route registration process, such as an app outage, or that something in the route register management plane has failed. If visualizing these metrics on a dashboard, gorouter.total_routes can be helpful for visualizing dramatic drops. However, for alerting purposes, the gorouter.ms_since_last_registry_update metric is more valuable for quicker identification of Gorouter issues. Alerting thresholds for gorouter.total_routes should focus on dramatic increases or decreases out of expected range.Origin: Firehose Type: Gauge (Float) Frequency: 30 s |
Recommended measurement | 5-minute average of the per second delta |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response |
|
uaa.requests.global.completed.count |
|
---|---|
Description | The lifetime number of requests completed by the UAA VM, emitted per UAA instance. This number includes health checks. Use: For capacity planning purposes, the aggregation of these values across all UAA instances can provide insight into the overall load that UAA is processing. VMware recommends alerting on unexpected spikes per UAA instance. Unusually high spikes, if they are not associated with an expected increase in demand, might indicate a DDoS risk and should be investigated. For performance and capacity management, look at the UAA Throughput metric as either a requests-completed-per-second or requests-completed-per-minute rate to determine the throughput per UAA instance. This helps you see trends in the throughput rate that indicate a need to scale UAA instances. Use the trends you observe to tune the threshold alerts for this metric. From performance and load testing of UAA, VMware has observed that while UAA endpoints can have different throughput behavior, once throughput reaches its peak value per VM, it stays constant and latency increases. Origin: Firehose Type: Gauge (Integer), emitted value increments over the lifetime of the VM like a counter Frequency: 5 s |
Recommended measurement | Average over the last 5 minutes of the derived requests-per-second or requests-per-minute rate, per instance |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response | For optimizing UAA, consider this metric in the context of UAA request latency and UAA VM CPU utilization. To increase throughput and maintain low latency, scale the UAA VMs horizontally by editing the number of your UAA VM instances in the Resource Config pane of the TAS for VMs tile and ensure that the system.cpu.user metric for UAA is not sustained in the suggested range of 80-90% maximum CPU utilization. For more information, see UAA Request Latency and UAA VM CPU Utilization in Key Capacity Scaling Indicators. |
gorouter.latency.uaa |
|
---|---|
Description | Time in milliseconds that UAA took to process a request that the Gorouter sent to UAA endpoints. Use: Indicates how responsive UAA has been to requests sent from the Gorouter. Some operations might take longer to process, such as creating bulk users and groups. It is important to correlate latency observed with the endpoint and evaluate this data in the context of overall historical latency from that endpoint. Unusual spikes in latency can indicate the need to scale UAA VMs. This metric is emitted only for the routers serving the UAA system component and is not emitted per isolation segment even if you are using isolated routers. Origin: Firehose Type: Gauge (Float in ms) Frequency: Emitted per Gorouter request to UAA |
Recommended measurement | Maximum, per job, over the last 5 minutes |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response | Latency depends on the endpoint and operation being used. It is important to correlate the latency with the endpoint and evaluate this data in the context of the historical latency from that endpoint.
|
uaa.server.inflight.count |
|
---|---|
Description | The number of requests UAA is currently processing (in-flight requests), emitted per UAA instance. Use: Indicates how many concurrent requests are currently in flight for the UAA instance. Unusually high spikes, if they are not associated with an expected increase in demand, might indicate a DDoS risk. From performance and load testing of the UAA component, VMware has observed that the number of concurrent requests impacts throughput and latency. The UAA Requests In Flight metric helps you see trends in the request rate that can indicate the need to scale UAA instances. Use the trends you observe to tune the threshold alerts for this metric. Origin: Firehose Type: Gauge (Integer) Frequency: 5 s |
Recommended measurement | Maximum, per job, over the last 5 minutes |
Recommended alert thresholds | Yellow warning: Dynamic Red critical: Dynamic |
Recommended response | To increase throughput and maintain low latency when the number of in-flight requests is high, scale UAA VMs horizontally by editing the UAA VM field in the Resource Config pane of the TAS for VMs tile. Ensure that the system.cpu.user metric for UAA is not sustained in the suggested range of 80-90% maximum CPU utilization. |
These sections describe system metrics, or BOSH metrics.
BOSH system metrics appear in the Firehose in two different formats. The tables in the following section list both formats.
system.healthy system_healthy |
|
---|---|
Description | 1 means the system is healthy, and 0 means the system is not healthy.Use: This is the most important BOSH metric to monitor. It indicates if the VM emitting the metric is healthy. Review this metric for all VMs to estimate the overall health of the system. Multiple unhealthy VMs signals problems with the underlying IaaS layer. Origin: Firehose Type: Gauge (Float, 0-1) Frequency: 60 s |
Recommended measurement | Average over the last 5 minutes |
Recommended alert thresholds | Yellow warning: N/A Red critical: < 1 |
Recommended response | Investigate TAS for VMs logs for the unhealthy components. |
system.disk.system.percent system_disk_system_percent |
|
---|---|
Description | System disk — Percentage of the system disk used on the VM Use: Set an alert to indicate when the system disk is almost full. Origin: Firehose Type: Gauge (%) Frequency: 60 s |
Recommended measurement | Average over the last 30 minutes |
Recommended alert thresholds | Yellow warning: ≥ 80% Red critical: ≥ 90% |
Recommended response | Investigate what is filling the jobs system partition. This partition should not typically fill because BOSH deploys jobs to use ephemeral and persistent disks. |
system.disk.ephemeral.percent system_disk_ephemeral_percent |
|
---|---|
Description | Ephemeral disk — Percentage of the ephemeral disk used on the VM Use: Set an alert and investigate if the ephemeral disk usage is too high for a job over an extended period. Origin: Firehose Type: Gauge (%) Frequency: 60 s |
Recommended measurement | Average over the last 30 minutes |
Recommended alert thresholds | Yellow warning: ≥ 80% Red critical: ≥ 90% |
Recommended response |
|
system.disk.persistent.percent system_disk_persistent_percent |
|
---|---|
Description | Persistent disk — Percentage of persistent disk used on the VM Use: Set an alert and investigate further if the persistent disk usage for a job is too high over an extended period. Origin: Firehose Type: Gauge (%) Frequency: 60 s |
Recommended measurement | Average over the last 30 minutes |
Recommended alert thresholds | Yellow warning: ≥ 80% Red critical: ≥ 90% |
Recommended response |
|
This section describes key indicators for monitoring your system’s ability to handle telemetry data and provide appropriate scaling guidance.
backpressure_drain_destinations | |
---|---|
Description | Backpressure reflects the difference between data ingested and the data successfully exported. It’s a signal of congestion at the drain destinations where data may begin piling up. |
Recommended measurement |
To measure failed enqueues, track the following metrics:
|
Recommended alert thresholds | Alert when a high proportion of queue_size to queue_capacity persists over a prolonged window. |
Recommended response |
|
data_loss | |
---|---|
Description | Data loss indicates the amount of telemetry data (logs, metrics, spans) dropped at the VM level. |
Recommended measurement | To compute the data drop rate, use the formula: Data drop rate = 1 - (e/r) Where: e : Number of data points successfully exported.r : Number of data points successfully received. Receiver metrics:
Exporter metrics:
|
Recommended alert thresholds | Set alert thresholds based on acceptable data loss for your project. Select a narrow time window before alerting begins to avoid notifications for small losses that are within the desired reliability range and not considered outages. |
Recommended response |
|