The KPIs described here are provided for operators to give general guidance on monitoring a TAS for VMs deployment using platform component and system (BOSH) metrics. Although many metrics are emitted from the platform, the KPIs are high-signal-value metrics that can indicate emerging platform issues.

This alerting and response guidance has been shown to apply to most deployments. VMware recommends that operators continue to fine-tune the alert measures to their deployment by observing historical trends. VMware also recommends that operators expand beyond this guidance and create new, deployment-specific monitoring metrics, thresholds, and alerts based on learning from their deployments.

Thresholds noted as "dynamic" in the following tables indicate that while a metric is highly important to watch, the relative numbers to set threshold warnings at are specific to a given TAS for VMs deployment and its use cases. These dynamic thresholds should be occasionally revisited because the foundation and its usage continue to evolve. For more information, see Warning and Critical Thresholds in Selecting and Configuring a Monitoring System.

While the performance impact on TAS for VMs is considered when building new features, VMware does not perform discrete load and scale testing on a regular basis. It is impossible to test each configuration of TAS for VMs with every type of infrastructure and app workload. Therefore, VMware recommends using the following KPIs to identify when to scale your system.

For more information about accessing metrics used in these key performance indicators, see Overview of Logging and Metrics.

Diego auctioneer metrics

These sections describe Diego Auctioneer metrics.

Auctioneer app instance (AI) placement failures

auctioneer.AuctioneerLRPAuctionsFailed
Description The number of Long Running Process (LRP) instances that the Auctioneer failed to place on Diego Cells. This metric is cumulative over the lifetime of the Auctioneer job.

Use: This metric can indicate that TAS for VMs is out of container space or that there is a lack of resources within your environment. This indicator also increases when the LRP is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work.

This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled.

This error is most common due to capacity issues. For example, if Diego Cells do not have enough resources, or if Diego Cells are going back and forth between a healthy and unhealthy state.

Origin: Firehose
Type: Counter (Integer)
Frequency: During each auction
Recommended measurement Per minute delta averaged over a 5-minute window
Recommended alert thresholds Yellow warning: ≥ 0.5
Red critical: ≥ 1
Recommended response
  1. To best determine the root cause, examine the Auctioneer logs. Depending on the specific error and resource constraint, you might also find a failure reason in the Cloud Controller (CC) API.
  2. Investigate the health of your Diego Cells to determine if they are the resource type causing the problem.
  3. Consider scaling additional Diego Cells using Tanzu Operations Manager.
  4. If scaling Diego Cells does not solve the problem, pull Diego Brain logs and BBS node logs and contact Support telling them that LRP auctions are failing.

Auctioneer time to fetch Diego Cell state

auctioneer.AuctioneerFetchStatesDuration
Description Time in ns that the Auctioneer took to fetch state from all the Diego Cells when running its auction.

Use: Indicates how the Diego Cells themselves are performing. Alerting on this metric helps alert that app staging requests to Diego might be failing.

Origin: Firehose
Type: Gauge, integer in ns
Frequency: During event, during each auction
Recommended measurement Maximum over the last 5 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 2 s
Red critical: ≥ 5 s
Recommended response
  1. Check the health of the Diego Cells by reviewing the logs and looking for errors.
  2. Review IaaS console metrics.
  3. Inspect the Auctioneer logs to determine if one or more Diego Cells is taking significantly longer to fetch state than other Diego Cells. Relevant log lines have wording like `fetched Diego Cell state`.
  4. Pull Diego Brain logs, Diego Cell logs, and Auctioneer logs and contact Support telling them that fetching Diego Cell states is taking too long.

Auctioneer app instance starts

auctioneer.AuctioneerLRPAuctionsStarted
Description The number of LRP instances that the Auctioneer successfully placed on Diego Cells. This metric is cumulative over the lifetime of the Auctioneer job.

Use: Provides a sense of running system activity levels in your environment. Can also give you a sense of how many app instances have been started over time. The measurement VMware recommends can help indicate a significant amount of container churn. However, for capacity planning purposes, it is more helpful to observe deltas over a long time window.

This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled.

Origin: Firehose
Type: Counter (Integer)
Frequency: During event, during each auction
Recommended measurement Per minute delta averaged over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response When observing a significant amount of container churn:

  1. Look to eliminate explainable causes of temporary churn, such as a deployment or increased developer activity.
  2. If container churn appears to continue over an extended period, pull logs from the Diego Brain and BBS node before contacting Support.
When observing extended periods of high or low activity trends, scale TAS for VMs components up or down as needed.

Auctioneer task placement failures

auctioneer.AuctioneerTaskAuctionsFailed
Description The number of Tasks that the Auctioneer failed to place on Diego Cells. This metric is cumulative over the lifetime of the Auctioneer job.

Use: Failing Task auctions indicate a lack of resources within your environment and that you likely need to scale. This indicator also increases when the Task is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work.

This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no tasks being scheduled.

This error is most common due to capacity issues. For example, if Diego Cells do not have enough resources, or if Diego Cells are going back and forth between a healthy and unhealthy state.

Origin: Firehose
Type: Counter (Float)
Frequency: During event, during each auction
Recommended measurement Per minute delta averaged over a 5-minute window
Recommended alert thresholds Yellow warning: ≥ 0.5
Red critical: ≥ 1
Recommended response
  1. To best determine the root cause, examine the Auctioneer logs. Depending on the specific error or resource constraint, you might also find a failure reason in the CC API.
  2. Investigate the health of Diego Cells.
  3. Consider scaling additional Diego Cells using Tanzu Operations Manager.
  4. If scaling Diego Cells does not solve the problem, pull Diego Brain logs and BBS logs for troubleshooting and contact Support for additional troubleshooting. Inform Support that Task auctions are failing.

Diego BBS metrics

These sections describe Diego BBS metrics.

BBS time to run LRP convergence

bbs.ConvergenceLRPDuration
Description Time in ns that the BBS took to run its LRP convergence pass.

Use: If the convergence run begins taking too long, apps or tasks might fail without restarting. This symptom can also indicate loss of connectivity to the BBS database.

Origin: Firehose
Type: Gauge (Integer in ns)
Frequency: During event, every 30 seconds when LRP convergence runs, emission should be near-constant on a running deployment
Recommended measurement Maximum over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 10 s
Red critical: ≥ 20 s
Recommended response
  1. Check BBS logs for errors.
  2. Try vertically scaling the BBS VM resources up. For example, add more CPUs or memory depending on its system.cpu/system.memory metrics.
  3. Consider vertically scaling the TAS for VMs backing database, if system.cpu and system.memory metrics for the database instances are high.
  4. If that does not solve the issue, pull the BBS logs and contact Support for additional troubleshooting.

BBS time to handle requests

bbs.RequestLatency
Description The maximum observed latency time over the past 60 seconds that the BBS took to handle requests across all its API endpoints.

Diego now aggregates this metric to emit the maximum value observed over 60 seconds.

Use: If this metric rises, the TAS for VMs API is slowing. Response to certain cf CLI commands is slow if request latency is high.

Origin: Firehose
Type: Gauge (Integer in ns)
Frequency: 60 s
Recommended measurement Average over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 5 s
Red critical: ≥ 10 s
Recommended response
  1. Check CPU and memory statistics in Tanzu Operations Manager.
  2. Check BBS logs for faults and errors that can indicate issues with BBS.
  3. Try scaling the BBS VM resources up. For example, add more CPUs and memory depending on its system.cpu/system.memory metrics.
  4. Consider vertically scaling the TAS for VMs backing database, if system.cpu and system.memory metrics for the database instances are high.
  5. If the previous steps do not solve the issue, collect a sample of the Diego Cell logs from the BBS VMs and contact Support to troubleshoot further.

Cloud Controller and Diego in sync

bbs.Domain.cf-apps
Description Indicates if the cf-apps Domain is up-to-date, meaning that TAS for VMs app requests from Cloud Controller are synchronized to bbs.LRPsDesired (Diego-desired AIs) for execution.
  • 1 means cf-apps Domain is up-to-date
  • No data received means cf-apps Domain is not up-to-date
Use: If the cf-apps Domain does not stay up-to-date, changes requested in the Cloud Controller are not guaranteed to propagate throughout the system. If the Cloud Controller and Diego are out of sync, then apps running can vary from those desired.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Value over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: ≠ 1

The threshold value VMware recommends represents a state where an up-to-date metric 1 has not been received for the entire 5-minute window.
Recommended response
  1. Check the BBS and Clock Global (Cloud Controller clock) logs.
  2. If the problem continues, pull the BBS logs and Clock Global (Cloud Controller clock) logs and contact Support to say that the cf-apps domain is not being kept fresh.

More app instances than expected

bbs.LRPsExtra
Description Total number of LRP instances that are no longer desired but still have a BBS record. When Diego wants to add more apps, the BBS sends a request to the Auctioneer to spin up additional LRPs. LRPsExtra is the total number of LRP instances that are no longer desired but still have a BBS record.

Use: If Diego has more LRPs running than expected, there might be problems with the BBS.

Deleting an app with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsExtra is unusual and should be investigated.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: ≥ 5
Red critical: ≥ 10
Recommended response
  1. Review the BBS logs for proper operation or errors, looking for detailed error messages.
  2. If the condition persists, pull the BBS logs and contact support.

Fewer app instances than expected

bbs.LRPsMissing
Description Total number of LRP instances that are desired but have no record in the BBS. When Diego wants to add more apps, the BBS sends a request to the Auctioneer to spin up additional LRPs. LRPsMissing is the total number of LRP instances that are desired but have no BBS record.

Use: If Diego has less LRP running than expected, there might be problems with the BBS.

An app push with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsMissing is unusual and should be investigated.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: ≥ 5
Red critical: ≥ 10
Recommended response
  1. Review the BBS logs for proper operation or errors, looking for detailed error messages.
  2. If the condition persists, pull the BBS logs and contact Support.

Crashed app instances

bbs.CrashedActualLRPs
Description Total number of LRP instances that have crashed.

Use: Indicates how many instances in the deployment are in a crashed state. An increase in bbs.CrashedActualLRPs can indicate several problems, from a bad app with many instances associated, to a platform issue that is resulting in app crashes. Use this metric to help create a baseline for your deployment. After you have a baseline, you can create a deployment-specific alert to notify of a spike in crashes higher than the trend line. Tune alert values to your deployment.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Look at the BBS logs for apps that are crashing and at the Diego Cell logs to see if the problem is with the apps themselves, rather than a platform issue.
  2. Before contacting Support, pull the BBS logs and, if particular apps are the problem, pull the logs from their Diego Cells too.

Running app instances, rate of change

1hr average of bbs.LRPsRunning – prior 1hr average of bbs.LRPsRunning
Description Rate of change in app instances being started or stopped on the platform. It is derived from bbs.LRPsRunning and represents the total number of LRP instances that are running on Diego Cells.

Use: Delta reflects upward or downward trend for app instances started or stopped. Helps to provide a picture of the overall growth trend of the environment for capacity planning. You might want to alert on delta values outside of the expected range.

Origin: Firehose
Type: Gauge (Float)
Frequency: During event, emission must be constant on a running deployment
Recommended measurement derived=(1-hour average of bbs.LRPsRunning – prior 1-hour average of bbs.LRPsRunning)
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Scale components as necessary.

BBS master elected

bbs.BBSMasterElected
Description Indicates when there is a BBS master election. A BBS master election takes place when a BBS instance has taken over as the active instance. A value of 1 is emitted when the election takes place.

Use: This metric emits when a redeployment of the BBS occurs. If this metric is emitted frequently outside of a deployment, this might be a signal of underlying problems that should be investigated. If the active BBS is continually changing, this can cause app push downtime.

Origin: Firehose
Type: Gauge (Float)
Frequency: On event
Recommended measurement N/A, the most effective visualization is as a stacked bar chart
Recommended alert thresholds Yellow warning: N/A
Red critical: N/A
Recommended response
  1. Check the BBS logs.
  2. Check the BBS VM load average and instance group size.
  3. Check the health of the connection to the backing SQL database and network latency.

Diego Cell metrics

These sections describe Diego Cell metrics.

Remaining memory available — Diego Cell memory chunks available

rep.CapacityRemainingMemory
Description Remaining amount of memory in MiB available for this Diego Cell to allocate to containers.

Use: Indicates the available Diego Cell memory. Insufficient Diego Cell memory can prevent pushing and scaling apps.

The strongest operational value of this metric is to interpert a deployment's average app size and monitor/alert on ensuring that at least some Cells have large enough capacity to accept standard app size pushes. For example, if pushing a 4 GB app, Diego can have trouble placing that app if there is no one Diego Cell with sufficient capacity of 4 GB or greater.

As an example, VMware Cloud Ops uses a standard of 4 GB, and computes and monitors for the number of Diego Cells with at least 4 GB free. When the number of Diego Cells with at least 4 GB falls below a defined threshold, this is a scaling indicator alert to increase capacity. This free chunk count threshold should be tuned to the deployment size and the standard size of apps being pushed to the deployment.

Origin: Firehose
Type: Gauge (Integer in MiB)
Frequency: 60 s
Recommended measurement For alerting:
  1. Determine the size of a standard app in your deployment. This is the suggested value to calculate free chunks of Remaining Memory by.
  2. Create a script/tool that can iterate through each Diego Cell and do the following:
    1. Pull the rep.CapacityRemainingMemory metric for each Diego Cell.
    2. Divide the values received by 1000 to get the value in GB (threshold is GB-based).
    3. Compare recorded values to your minimum capacity threshold, and count the number of Diego Cells that have equal or greater than the desired amount of free chunk space.
  3. Determine a desired scaling threshold based on the minimum amount of free chunks that are acceptable in this deployment given historical trends.
  4. Set an alert to indicate the need to scale Diego Cell memory capacity when the value falls below the desired threshold number.
For visualization purposes:
Looking at this metric (rep.CapacityRemainingMemory) as a minimum value per Diego Cell has more informational value than alerting value. It can be an interesting heatmap visualization, showing average variance and density over time.
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Assign more resources to the Diego Cells or assign more Diego Cells.
  2. Scale additional Diego Cells using Tanzu Operations Manager.
Alternative metric If you are using Healthwatch, VMware recommends using the metric healthwatch.Diego.AvailableFreeChunks. For more information, see Healthwatch Metrics in the Healthwatch documentation.

Remaining memory available — Overall remaining memory available


rep.CapacityRemainingMemory
(Alternative Use)

Description Remaining amount of memory in MiB available for this Diego Cell to allocate to containers.

Use: Can indicate low memory capacity overall in the platform. Low memory can prevent app scaling and new deployments. The overall sum of capacity can indicate that you need to scale the platform. Observing capacity consumption trends over time helps with capacity planning.

Origin: Firehose
Type: Gauge (Integer in MiB)
Frequency: 60 s
Recommended measurement Minimum over the last 5 minutes divided by 1024 (across all instances)
Recommended alert thresholds Yellow warning: ≤ 64 GB
Red critical: ≤ 32 GB
Recommended response
  1. Assign more resources to the Diego Cells or assign more Diego Cells.
  2. Scale additional Diego Cells using Tanzu Operations Manager.
Alternative metric If you are using Healthwatch, VMware recommends the metric healthwatch.Diego.AvailableFreeChunksfor this purpose. For more information, see Healthwatch Metrics in the Healthwatch documentation.

Remaining disk available — Diego Cell disk chunks available

rep.CapacityRemainingDisk
Description Remaining amount of disk in MiB available for this Diego Cell to allocate to containers.

Use: Indicates the available Diego Cell disk. Insufficient free disk on Diego Cells prevents the staging or starting of apps or tasks, resulting in error messages like ERR Failed to stage app: insufficient resources.

Because Diego fails to stage without at least 6 GB free, unreserved disk space on a given Diego Cell, the strongest operational value of this metric is to ensure that at least some Diego Cells have a large enough disk capacity to support the staging of apps and tasks.

VMware recommends computing and monitoring for the number of Diego Cells with at least 6 GB Disk free. When the number of Diego Cells with at least 6 GB falls below a defined threshold, this is a scaling indicator alert to increase capacity. The alerting threshold value for the amount of free chunks of Disk should be tuned to the deployment size and the standard size of apps being pushed to the deployment.

Origin: Firehose
Type: Gauge (Integer in MiB)
Frequency: 60 s
Recommended measurement For alerting:
  1. Because Diego fails to stage without at least 6 GB free, this is the suggested minimum value to calculate free chunks of Remaining Disk by.
  2. Create a script/tool that can iterate through each Diego Cell and do the following:
    1. Pull the rep.CapacityRemainingDisk metric for each Diego Cell.
    2. Divide the values received by 1000 to get the value in GB (threshold is GB-based).
    3. Compare recorded values to your minimum capacity threshold, and count the number of Diego Cells that have equal or greater than the desired amount of free chunk space.
  3. Determine a desired scaling threshold based on the minimum amount of free chunks that are acceptable in this deployment given historical trends.
  4. Set an alert to indicate the need to scale Diego Cell disk capacity when the value falls below the desired threshold number.
For visualization purposes:
Looking at this metric (rep.CapacityRemainingDisk) as a minimum value per Diego Cell has more informational value than alerting value. It can be an interesting heatmap visualization, showing average variance and density over time.
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Assign more resources to the Diego Cells or assign more Diego Cells.
  2. Scale additional Diego Cells using Tanzu Operations Manager.
Alternative metric If you are using Healthwatch, VMware recommends the metric healthwatch.Diego.AvailableFreeChunksfor this purpose. For more information, see Healthwatch Metrics in the Healthwatch documentation.

Remaining disk available - Overall remaining disk available

rep.CapacityRemainingDisk
(Alternative Use)
Description Remaining amount of disk in MiB available for this Diego Cell to allocate to containers.

Use: Low disk capacity can prevent app scaling and new deployments. Because Diego staging Tasks can fail without at least 6 GB free, the red threshold VMware recommends is based on the minimum disk capacity across the deployment falling below 6 GB in the previous 5 minutes.

It can also be advantageous to assess how many chunks of free disk space are higher than a given threshold, similar to rep.CapacityRemainingMemory.

Origin: Firehose
Type: Gauge (Integer in MiB)
Frequency: 60 s
Recommended measurement Minimum over the last 5 minutes divided by 1024 (across all instances)
Recommended alert thresholds Yellow warning: ≤ 12 GB
Red critical: ≤ 6 GB
Recommended response
  1. Assign more resources to the Diego Cells or assign more Diego Cells.
  2. Scale additional Diego Cells using Tanzu Operations Manager.
Alternative Metric If you are using Healthwatch, VMware recommends the metric healthwatch.Diego.AvailableFreeChunksfor this purpose. For more information, see Healthwatch Metrics in the Healthwatch documentation.

Diego Cell rep time to sync

rep.RepBulkSyncDuration
Description Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual Garden containers.

Use: Sync times that are too high can indicate issues with the BBS.

Origin: Firehose
Type: Gauge (Float in ns)
Frequency: 30 s
Recommended measurement Maximum over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 5 s
Red critical: ≥ 10 s
Recommended response
  1. Investigate BBS logs for faults and errors.
  2. If a particular Diego Cell or Diego Cells appear problematic, pull logs for the Diego Cells and the BBS logs before contacting Support.

Garden health check failed

rep.GardenHealthCheckFailed
Description The Diego Cell periodically checks its health against the Garden back end. For Diego Cells, 0 means healthy, and 1 means unhealthy.

Use: Set an alert for further investigation if multiple unhealthy Diego Cells are detected in the given time window. If one Diego Cell is impacted, it does not participate in auctions, but end-user impact is usually low. If multiple Diego Cells are impacted, this can indicate a larger problem with Diego, and should be considered a more critical investigation need.

Suggested alert threshold based on multiple unhealthy Diego Cells in the given time window.

Although end-user impact is usually low if only one Diego Cell is impacted, this should still be investigated. Particularly in a lower capacity environment, this situation can result in negative end-user impact if left unresolved.

Origin: Firehose
Type: Gauge (Float, 0-1)
Frequency: 30 s
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: = 1
Red critical: > 1
Recommended response
  1. Investigate Diego Cell servers for faults and errors.
  2. If a particular Diego Cell or Diego Cells appear problematic:
    1. Determine a time interval during which the metrics from the Diego Cell changed from healthy to unhealthy.
    2. Pull the logs that the Diego Cell generated over that interval. The Diego Cell ID is the same as the BOSH instance ID.
    3. Pull the BBS logs over that same time interval.
    4. Contact Support.
  3. As a last resort, if you cannot wait for Support, it sometimes helps to recreate the Diego Cell by running bosh recreate. For information about the bosh recreate command syntax, see Deployments in Commands in the BOSH documentation.

    Recreating a Diego Cell destroys its logs. To activate a root cause analysis of the Diego Cell's problem, save out its logs before running bosh recreate.

Diego Locket metrics

These sections describe Diego Locket metrics.

Locks held by BBS

bbs.LockHeld
Description Whether a BBS instance holds the expected BBS lock (in Locket). 1 means the active BBS server holds the lock, and 0 means the lock was lost.

Use: This metric is complimentary to Active Locks, and it offers a BBS-level version of the Locket metrics. Although it is emitted per BBS instance, only 1 active lock is held by BBS. Therefore, the expected value is 1. The metric might be 0 when the BBS instances are performing a leader transition, but a prolonged value of 0 indicates an issue with BBS.

Origin: Firehose
Type: Gauge
Frequency: Periodically
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: ≠ 1
Recommended response
  1. Run monit status on the Diego database VM to check for failing processes.
  2. If there are no failing processes, then review the logs for BBS.
    • A healthy BBS shows obvious activity around starting or claiming LRPs.
    • An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer.
  3. If you are unable to resolve the issue, pull logs from the Diego BBS, which include the Locket service component logs, and contact Support.

Locks held by auctioneer

auctioneer.LockHeld
Description Whether an Auctioneer instance holds the expected Auctioneer lock (in Locket). 1 means the active Auctioneer holds the lock, and 0 means the lock was lost.

Use: This metric is complimentary to Active Locks, and it offers an Auctioneer-level version of the Locket metrics. Although it is emitted per Auctioneer instance, only 1 active lock is held by Auctioneer. Therefore, the expected value is 1. The metric might be 0 when the Auctioneer instances are performing a leader transition, but a prolonged value of 0 indicates an issue with Auctioneer.

Origin: Firehose
Type: Gauge
Frequency: Periodically
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: ≠ 1
Recommended response
  1. Run monit status on the Diego Database VM to check for failing processes.
  2. If there are no failing processes, then review the logs for Auctioneer.
    • Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to run work. This attempt corresponds to app development activity, such as cf push.
  3. If you are unable to resolve the issue, pull logs from the Diego BBS and Auctioneer VMs, which includes the Locket service component logs, and contact Support.

Active presences

locket.ActivePresences
Description Total count of active presences. Presences are defined as the registration records that the Diego Cells maintain to advertise themselves to the platform.

Use: If the Active Presences count is far from the expected, there might be a problem with Diego.

The number of active presences varies according to the number of Diego Cells deployed. Therefore, during purposeful scale adjustments to TAS for VMs, this alerting threshold should be adjusted.
Establish an initial threshold by observing the historical trends for the deployment over a brief period of time, Increase the threshold as more Diego Cells are deployed. During a rolling deploy, this metric shows variance during the BOSH lifecycle when Diego Cells are evacuated and restarted. Tolerable variance is within the bounds of the BOSH maximum in-flight range for the instance group.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Maximum over the last 15 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Ensure that the variance is not the result of an active rolling deploy. Also ensure that the alert threshold is appropriate to the number of Diego Cells in the current deployment.
  2. Run monit status to inspect for failing processes.
  3. If there are no failing processes, then review the logs for the components using the Locket service itself on Diego BBS instances.
  4. If you are unable to resolve the problem, pull the logs from the Diego BBS, which include the Locket service component logs, and contact Support.

Diego Route Emitter metrics

These sections describe Diego Route Emitter metrics.

Route Emitter time to sync

route_emitter.RouteEmitterSyncDuration
Description Time in ns that the active Route Emitter took to perform its synchronization pass.

Use: Increases in this metric indicate that the Route Emitter might have trouble maintaining an accurate routing table to broadcast to the Gorouters. Tune alerting values to your deployment based on historical data and adjust based on observations over time. The suggested starting point is ≥ 5 for the yellow threshold and ≥ 10 for the critical threshold.

Origin: Firehose
Type: Gauge (Float in ns)
Frequency: 60 s
Recommended measurement Maximum, per job, over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response If all or many jobs showing as impacted, there is likely an issue with Diego.
  1. Investigate the Route Emitter and Diego BBS logs for errors.
  2. Verify that app routes are functional by making a request to an app, pushing an app and pinging it, or if applicable, checking that your smoke tests have passed.
If one or a few jobs showing as impacted, there is likely a connectivity issue and the impacted job should be investigated further.

TAS for VMs MySQL KPIs

These sections describe TAS for VMs MySQL KPIs.

When TAS for VMs uses an internal MySQL database, as configured in the Databases pane of the TAS for VMs tile, the database cluster generates KPIs as described here.

This section assumes you are using the Internal Databases - MySQL - Percona XtraDB Cluster option as your system database.

Server availability


/mysql/available

Description If the MySQL Server is responding to requests. This indicates if the component is available.

Use: If the server does not emit heartbeats, it is offline.

Origin: Doppler/Firehose
Type: Boolean
Frequency: 30 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Check the MySQL Server logs for errors. You can find the instance by targeting your MySQL deployment with BOSH and inspecting logs for the instance. For more information, see Failing Jobs and Unhealthy Instances.

If your service plan is an highly available (HA) cluster, you can also run mysql-diag to check logs for errors.

Persistent Disk Used


/mysql/system/persistent_disk_used_percent

Description The percentage of disk used on the persistent file system.

Use: MySQL cannot function correctly if there is not sufficient free space on the file systems. Use these metrics to ensure that you have disks large enough for your user base.

Origin: Doppler/Firehose
Type: Percent
Frequency: 30 s (default)
Recommended measurement Maximum of persistent disk used of all of nodes
Recommended alert thresholds Single Node and Leader Follower:
  • Yellow warning: > 25%
  • Red critical: > 30%
Highly Available Cluster:
  • Yellow warning: > 80%
  • Red critical: > 90%
Recommended response Upgrade the service instance to a plan with larger disk capacity.

For Tanzu SQL for VMs v2.9 and later, if you set the optimize_for_short_words parameter to true, then see Troubleshooting VMware Tanzu SQL with MySQL for VMs before upgrading the service.

Ephemeral Disk Used


/mysql/system/ephemeral_disk_used_percent

Description The percentage of disk used on the ephemeral file system.

Use: MySQL cannot function correctly if there is not sufficient free space on the file systems. Use these metrics to ensure that you have disks large enough for your user base.

Origin: Doppler/Firehose
Type: Percent
Frequency: 30 s (default)
Recommended measurement Maximum disk used of all nodes
Recommended alert thresholds Yellow warning: > 80%
Red critical: > 95%
Recommended response Upgrade the service instance to a plan with larger disk capacity.

CPU use percentage


/mysql/performance/cpu_utilization_percent

Description CPU time being consumed by the MySQL service.

Use: A node that experiences context switching or high CPU use becomes unresponsive. This also affects the ability of the node to report metrics.

Origin: Doppler/Firehose
Type: Percent
Frequency: 30 s (default)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 80 Red critical: > 90
Recommended response Discover what is using so much CPU. If it is from normal processes, update the service instance to use a plan with larger CPU capacity.

Connections


/mysql/variables/max_connections

/p.mysql/net/max_used_connections

Description The maximum number of connections used over the maximum permitted number of simultaneous client connections.

Use: If the number of connections drastically changes or if apps are unable to connect, there might be a network or app issue.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s
Recommended measurement max_used_connections / max_connections
Recommended alert thresholds Yellow warning: > 80 %
Red critical: > 90 %
Recommended response If this measurement meets or exceeds 80% with exponential growth, monitor app use to ensure that everything is working.

When approaching 100% of maximum connections, apps might not always connect to the database. The connections/second for a service instance vary based on app instances and app use.

Queries Delta


/mysql/performance/queries_delta

Description The number of statements executed by the server over the last 30 seconds.

Use: The server always processes queries. If the server does not process queries, the server is non-functional.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s
Recommended measurement Average over last 2 minutes
Recommended alert thresholds Red critical: 0
Recommended response Investigate the MySQL server logs, such as the audit log, to understand why query rate changed and decide on appropriate action.

Highly Available Cluster WSREP Ready


/mysql/galera/wsrep_ready

Description Shows whether each cluster node can accept queries. Returns only 0 or 1. When this metric is 0, almost all queries to that node fail with the error:
ERROR 1047 (08501) Unknown Command

Use: Discover when nodes of a cluster were unable to communicate and accept transactions.

Origin: Doppler/Firehose
Type: Boolean
Frequency: 30 s (default)
Recommended measurement Average of values of each cluster node, over the last 5 minutes
Recommended alert thresholds Yellow warning: < 1
Red critical: 0 (cluster is down)
Recommended response
  • Run mysql-diag and check the MySQL Server logs for errors.
  • Ensure that no infrastructure event is affecting intra-cluster communication.
  • Ensure that wsrep_ready is not set to off by using the query:
    SHOW STATUS LIKE 'wsrep_ready';.

Highly Available Cluster WSREP Cluster Size


/mysql/galera/wsrep_cluster_size

Description The number of cluster nodes with which each node is communicating normally.

Use: When running in a multi-node configuration, this metric indicates if each member of the cluster is communicating normally with all other nodes.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s (default)
Recommended measurement (Average of the values of each node / cluster size), over the last 5 minutes
Recommended alert thresholds Yellow warning: < 3 (availability compromised)
Red critical: < 1 (cluster unavailable)
Recommended response Run mysql-diag and check the MySQL Server logs for errors.

Highly Available Cluster WSREP Cluster Status


/mysql/galera/wsrep_cluster_status

Description Shows the primary status of the cluster component that the node is in.
Values are:
  • Primary: Node has a quorum.
  • Non-primary: Node has lost a quorum.
  • Disconnected: Node is unable to connect to other nodes.
Use: Any value other than Primary indicates that the node is part of a non-operational component. This occurs in cases of multiple membership changes that cause a loss of quorum.

Origin: Doppler/Firehose
Type: integer (see above)
Frequency: 30 s (default)
Recommended measurement Sum of each of the nodes, over the last 5 minutes
Recommended alert thresholds Yellow warning: < 3
Red critical: < 1
Recommended response
  • Verify that all nodes are in working order and can receive write-sets
  • Run mysql-diag and check the MySQL Server logs for errors

Gorouter metrics

These sections describe Gorouter metrics.

Router file descriptors

gorouter.file_descriptors
Description The number of file descriptors currently used by the Gorouter job.

Use: Indicates an impending issue with the Gorouter. Without proper mitigation, it is possible for an unresponsive app to eventually exhaust available Gorouter file descriptors and cause route starvation for other apps running on TAS for VMs. Under heavy load, this unmitigated situation can also result in the Gorouter losing its connection to NATS and all routes being pruned.

While a drop in gorouter.total_routes or an increase in gorouter.ms_since_last_registry_update helps to surface that the issue might already be occurring, alerting on gorouter.file_descriptors indicates that such an issue is impending.

The Gorouter limits the number of file descriptors to 100,000 per job. Once the limit is met, the Gorouter is unable to establish any new connections.

To reduce the risk of DDoS attacks, VMware recommends doing one or both of the following:

  • Within TAS for VMs, set Maximum connections per back end to define how many requests can be routed to each Gorouter app instance. This prevents a single app from using all Gorouter connections. The value specified should be determined by the operator based on the use cases for that foundation. A value of 0 sets no limit. The default number is 500.
  • Add rate limiting at the load balancer level.
Origin: Firehose
Type: Gauge
Frequency: 5 s
Recommended measurement Maximum, per Gorouter job, over the last 5 minutes
Recommended alert thresholds Yellow warning: 50,000 per job
Red critical: 60,000 per job
Recommended response
  1. Identify which app(s) are requesting excessive connections and resolve the impacting issues with these apps.
  2. If the mitigation steps have not already been taken, do so.
  3. Consider adding more Gorouter VM resources to increase the number of available file descriptors.

Router exhausted connections

gorouter.backend_exhausted_conns
Description The lifetime number of requests that have been rejected by the Gorouter VM due to the Max Connections Per Backend limit being reached across all tried back ends. The limit controls the number of concurrent TCP connections to any particular app instance and is configured within TAS for VMs.

Use: Indicates that TAS for VMs is mitigating risk to other apps by self-protecting the platform against one or more unresponsive apps. Increases in this metric indicate the need to investigate and resolve issues with potentially unresponsive apps. A rapid rate of change upward is concerning and should be assessed further.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Maximum delta per minute, per Gorouter job, over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. If gorouter.backend_exhausted_conns spikes, first look to the Router Throughput metric gorouter.total_requests to determine if this measure is high or low in relation to normal bounds for this deployment.
  2. If Router Throughput appears within normal bounds, it is likely that gorouter.backend_exhausted_conns is spiking due to an unresponsive app, possibly due to app code issues or underlying app dependency issues. To help determine the problematic app, look in access logs for repeated calls to one app. Then proceed to troubleshoot this app accordingly.
  3. If Router Throughput also shows unusual spikes, the cause of the increase in gorouter.backend_exhausted_conns spikes is likely external to the platform. Unusual increases in load can be due to expected business events driving additional traffic to apps. Unexpected increases in load can indicate a DDoS attack risk.

Router throughput

gorouter.total_requests
Description The lifetime number of requests completed by the Gorouter VM, emitted per Gorouter instance

Use: The aggregation of these values across all Gorouters provide insight into the overall traffic flow of a deployment. Unusually high spikes, if not known to be associated with an expected increase in demand, can indicate a DDoS risk. For performance and capacity management, consider this metric a measure of router throughput per job, converting it to requests-per-second, by looking at the delta value of gorouter.total_requests and deriving back to 1s, or gorouter.total_requests.delta)/5, per Gorouter instance. This helps you see trends in the throughput rate that indicate a need to scale the Gorouter instances. Use the trends you observe to tune the threshold alerts for this metric.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Average over the last 5 minutes of the derived per second calculation
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response For optimizing the Gorouter, consider the requests-per-second derived metric in the context of router latency and Gorouter VM CPU utilization. From performance and load testing of the Gorouter, VMware has observed that at approximately 2500 simple requests per second, latency can begin to increase. This number changes based on your traffic profile and VM capabilities.

To increase throughput and maintain low latency, scale the Gorouters either horizontally or vertically and ensure that the system.cpu.user metric for the Gorouter stays in the suggested range of 60-70% CPU Utilization. For more information about the system.cpu.user metric, see VM CPU Utilization.

Router handling latency

gorouter.latency
Description The time in milliseconds that represents the length of a request from the Gorouter's point of view. This timing starts when Gorouter recieves a request and stops when Gorouter finishes processing the response from the app. Long uploads, downloads, or app responses increase this time. This metric includes the request time to all back end endpoints, including both apps and routable system components like Cloud Controller and UAA.

Use: Indicates the traffic profile of TAS for VMs. An alert value on this metric should be tuned to the specifics of the deployment and its underlying network considerations; a suggested starting point is 100 ms.

Origin: Firehose
Type: Gauge (Float in ms)
Frequency: Emitted per Gorouter request, emission should be constant on a running deployment
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Extended periods of high latency can point to several factors. The Gorouter latency measure includes network and back end latency impacts as well.

  1. First inspect logs for network issues and indications of misbehaving back ends.
  2. If it appears that the Gorouter needs to scale due to ongoing traffic congestion, do not scale on the latency metric alone. You should also look at the CPU utilization of the Gorouter VMs and keep it within a maximum 60-70% range.
  3. Resolve high utilization by scaling the Gorouter.
  4. Follow steps in the doc Troubleshooting Slow Requests in CF.

Time since last route register received

gorouter.ms_since_last_registry_update
Description Time in milliseconds since the last route register was received, emitted per Gorouter instance

Use: Indicates if routes are not being registered to apps correctly.

Origin: Firehose
Type: Gauge (Float in ms)
Frequency: 30 s
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: > 30,000
This threshold is suitable for normal platform usage. It alerts if it has been at least 30 seconds since the Gorouter last received a message from an app.
Recommended response
  1. Search the Gorouter and Route Emitter logs for connection issues to NATS.
  2. Check the BOSH logs to see if the NATS, Gorouter, or Route Emitter VMs are failing.
  3. Look more broadly at the health of all VMs, particularly Diego-related VMs.
  4. If problems persist, pull the Gorouter and Route Emitter logs and contact Support to say there are consistently long delays in route registry.

Router Error: 502 bad gateway

gorouter.bad_gateways
Description The lifetime number of bad gateways, or 502 responses, from the Gorouter itself, emitted per Gorouter instance.
The Gorouter emits a 502 bad gateway error when it has a route in the routing table and, in attempting to make a connection to the back end, finds that the back end does not exist.

Use: Indicates that route tables might be stale. Stale routing tables suggest an issue in the route register management plane, which indicates that something has likely changed with the locations of the containers. Always investigate unexpected increases in this metric.

Origin: Firehose
Type: Count (Integer, Lifetime)
Frequency: 5 s
Recommended measurement Maximum delta per minute over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Check the Gorouter and Route Emitter logs to see if they are experiencing issues when connecting to NATS.
  2. Check the BOSH logs to see if the NATS, Gorouter, or Route Emitter VMs are failing.
  3. Look broadly at the health of all VMs, particularly Diego-related VMs.
  4. If problems persist, pull Gorouter and Route Emitter logs and contact Support to say there has been an unusual increase in Gorouter bad gateway responses.

Router Error: server error

gorouter.responses.5xx
Description The lifetime number of requests completed by the Gorouter VM for HTTP status family 5xx, server errors, emitted per Gorouter instance.

Use: A repeatedly crashing app is often the cause of a big increase in 5xx responses. However, response issues from apps can also cause an increase in 5xx responses. Always investigate an unexpected increase in this metric.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Maximum delta per minute over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Look for out-of-memory errors and other app-level errors.
  2. As a temporary measure, ensure that the troublesome app is scaled to more than one instance.

Number of Gorouter routes registered

gorouter.total_routes
Description The current total number of routes registered with the Gorouter, emitted per Gorouter instance

Use: The aggregation of these values across all Gorouters indicates uptake and gives a picture of the overall growth of the environment for capacity planning.

VMware also recommends alerting on this metric if the number of routes falls outside of the normal range for your deployment. Dramatic decreases in this metric volume can indicate a problem with the route registration process, such as an app outage, or that something in the route register management plane has failed.

If visualizing these metrics on a dashboard, gorouter.total_routes can be helpful for visualizing dramatic drops. However, for alerting purposes, the gorouter.ms_since_last_registry_update metric is more valuable for quicker identification of Gorouter issues. Alerting thresholds for gorouter.total_routes should focus on dramatic increases or decreases out of expected range.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement 5-minute average of the per second delta
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. For capacity needs, scale up or down the Gorouter VMs as necessary.
  2. For significant drops in current total routes, see the gorouter.ms_since_last_registry_update metric for additional context. For more information, see Time Since Last Route Register Received.
  3. Check the Gorouter and Route Emitter logs to see if they are experiencing issues when connecting to NATS.
  4. Check the BOSH logs to see if the NATS, Gorouter, or Route Emitter VMs are failing.
  5. Look broadly at the health of all VMs, particularly Diego-related VMs.
  6. If problems persist, pull the Gorouter and Route Emitter logs and contact Support.

UAA metrics

UAA throughput

uaa.requests.global.completed.count
Description The lifetime number of requests completed by the UAA VM, emitted per UAA instance. This number includes health checks.

Use: For capacity planning purposes, the aggregation of these values across all UAA instances can provide insight into the overall load that UAA is processing. VMware recommends alerting on unexpected spikes per UAA instance. Unusually high spikes, if they are not associated with an expected increase in demand, might indicate a DDoS risk and should be investigated.

For performance and capacity management, look at the UAA Throughput metric as either a requests-completed-per-second or requests-completed-per-minute rate to determine the throughput per UAA instance. This helps you see trends in the throughput rate that indicate a need to scale UAA instances. Use the trends you observe to tune the threshold alerts for this metric.

From performance and load testing of UAA, VMware has observed that while UAA endpoints can have different throughput behavior, once throughput reaches its peak value per VM, it stays constant and latency increases.

Origin: Firehose
Type: Gauge (Integer), emitted value increments over the lifetime of the VM like a counter
Frequency: 5 s
Recommended measurement Average over the last 5 minutes of the derived requests-per-second or requests-per-minute rate, per instance
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response For optimizing UAA, consider this metric in the context of UAA request latency and UAA VM CPU utilization. To increase throughput and maintain low latency, scale the UAA VMs horizontally by editing the number of your UAA VM instances in the Resource Config pane of the TAS for VMs tile and ensure that the system.cpu.user metric for UAA is not sustained in the suggested range of 80-90% maximum CPU utilization. For more information, see UAA Request Latency and UAA VM CPU Utilization in Key Capacity Scaling Indicators.

UAA request latency

gorouter.latency.uaa
Description Time in milliseconds that UAA took to process a request that the Gorouter sent to UAA endpoints.

Use: Indicates how responsive UAA has been to requests sent from the Gorouter. Some operations might take longer to process, such as creating bulk users and groups. It is important to correlate latency observed with the endpoint and evaluate this data in the context of overall historical latency from that endpoint. Unusual spikes in latency can indicate the need to scale UAA VMs.

This metric is emitted only for the routers serving the UAA system component and is not emitted per isolation segment even if you are using isolated routers.

Origin: Firehose
Type: Gauge (Float in ms)
Frequency: Emitted per Gorouter request to UAA
Recommended measurement Maximum, per job, over the last 5 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Latency depends on the endpoint and operation being used. It is important to correlate the latency with the endpoint and evaluate this data in the context of the historical latency from that endpoint.
  1. Inspect which endpoints requests are hitting. Use historical data to determine if the latency is unusual for that endpoint. For a list of UAA endpoints, see the UAA API documentation.
  2. If it appears that UAA needs to be scaled due to ongoing traffic congestion, do not scale based on the latency metric alone. You should also ensure that the system.cpu.user metric for UAA stays in the suggested range of 80-90% maximum CPU utilization.
  3. Resolve high utilization by scaling UAA VMs horizontally. To scale UAA, navigate to the Resource Config pane of the TAS for VMs tile and edit the number of your UAA VM instances.

UAA requests in flight

uaa.server.inflight.count
Description The number of requests UAA is currently processing (in-flight requests), emitted per UAA instance.

Use: Indicates how many concurrent requests are currently in flight for the UAA instance. Unusually high spikes, if they are not associated with an expected increase in demand, might indicate a DDoS risk.

From performance and load testing of the UAA component, VMware has observed that the number of concurrent requests impacts throughput and latency. The UAA Requests In Flight metric helps you see trends in the request rate that can indicate the need to scale UAA instances. Use the trends you observe to tune the threshold alerts for this metric.

Origin: Firehose
Type: Gauge (Integer)
Frequency: 5 s
Recommended measurement Maximum, per job, over the last 5 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response To increase throughput and maintain low latency when the number of in-flight requests is high, scale UAA VMs horizontally by editing the UAA VM field in the Resource Config pane of the TAS for VMs tile. Ensure that the system.cpu.user metric for UAA is not sustained in the suggested range of 80-90% maximum CPU utilization.

System (BOSH) metrics

These sections describe system metrics, or BOSH metrics.

BOSH system metrics appear in the Firehose in two different formats. The tables in the following section list both formats.

Virtual Machine (VM) health

system.healthy
system_healthy
Description 1 means the system is healthy, and 0 means the system is not healthy.

Use: This is the most important BOSH metric to monitor. It indicates if the VM emitting the metric is healthy. Review this metric for all VMs to estimate the overall health of the system.

Multiple unhealthy VMs signals problems with the underlying IaaS layer.

Origin: Firehose
Type: Gauge (Float, 0-1)
Frequency: 60 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Investigate TAS for VMs logs for the unhealthy components.

VM disk used

system.disk.system.percent
system_disk_system_percent
Description System disk — Percentage of the system disk used on the VM

Use: Set an alert to indicate when the system disk is almost full.

Origin: Firehose
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response Investigate what is filling the jobs system partition.
This partition should not typically fill because BOSH deploys jobs to use ephemeral and persistent disks.

VM ephemeral disk used

system.disk.ephemeral.percent
system_disk_ephemeral_percent
Description Ephemeral disk — Percentage of the ephemeral disk used on the VM

Use: Set an alert and investigate if the ephemeral disk usage is too high for a job over an extended period.

Origin: Firehose
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response
  1. Run bosh vms --details to view jobs on affected deployments.
  2. Determine cause of the data consumption, and, if appropriate, increase disk space or scale out the affected jobs.

VM persistent disk used

system.disk.persistent.percent
system_disk_persistent_percent
Description Persistent disk — Percentage of persistent disk used on the VM

Use: Set an alert and investigate further if the persistent disk usage for a job is too high over an extended period.

Origin: Firehose
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response
  1. Run bosh vms --details to view jobs on affected deployments.
  2. Determine cause of the data consumption, and, if appropriate, increase disk space or scale out affected jobs.

OpenTelemetry Collector metrics

This section describes key indicators for monitoring your system’s ability to handle telemetry data and provide appropriate scaling guidance.

Backpressure at drain destinations

backpressure_drain_destinations
Description Backpressure reflects the difference between data ingested and the data successfully exported. It’s a signal of congestion at the drain destinations where data may begin piling up.
Recommended measurement
  • otelcol_exporter_queue_capacity indicates the total queue capacity in batches of the sending queue.
  • otelcol_exporter_queue_size indicates the current size of the sending queue.
Use these two metrics to check if the queue capacity can support your workload.

To measure failed enqueues, track the following metrics:
  • otelcol_exporter_enqueue_failed_spans
  • otelcol_exporter_enqueue_failed_metric_points
  • otelcol_exporter_enqueue_failed_log_records
These metrics show how many spans, metrics, and logs failed to be enqueued into the pipeline when the queue was full.
Recommended alert thresholds Alert when a high proportion of queue_size to queue_capacity persists over a prolonged window.
Recommended response
  • Enable a retry mechanism to mitigate potential failures caused by backpressure in production deployments.
  • Consider horizontally scaling out affected jobs or decreasing the amount of telemetry generated by them.
  • If collector scaling does not resolve the issue, consider scaling the destination system.

Data Loss

data_loss
Description Data loss indicates the amount of telemetry data (logs, metrics, spans) dropped at the VM level.
Recommended measurement To compute the data drop rate, use the formula:
Data drop rate = 1 - (e/r)
Where:
e: Number of data points successfully exported.
r: Number of data points successfully received.

Receiver metrics:

  • otelcol_receiver_accepted_log_records
  • otelcol_receiver_accepted_metric_points
  • otelcol_receiver_accepted_spans

Exporter metrics:

  • otelcol_exporter_sent_log_records
  • otelcol_exporter_sent_metric_points
  • otelcol_exporter_sent_spans
By comparing the number of data points ingested (r) to the number successfully exported (e), you can calculate the data loss percentage.
  • If no data is dropped, e = r, and the data loss is 0.
  • If all data is dropped, e = 0, and the data loss is 1.
Recommended alert thresholds Set alert thresholds based on acceptable data loss for your project. Select a narrow time window before alerting begins to avoid notifications for small losses that are within the desired reliability range and not considered outages.
Recommended response
  • Check the CPU and memory consumption of the Collector in the affected VM.
  • Scale vertically by increasing CPU or memory allocated to the Collector to handle the data volume.
check-circle-line exclamation-circle-line close-line
Scroll to top icon