Key capacity scaling indicators

VMware provides these indicators to operators as general guidance for capacity scaling. Each indicator is based on all platform metrics from all components.

This guidance is applicable to most TAS for VMs deployments. VMware recommends that operators fine-tune the suggested alert thresholds by observing historical trends for their deployments.

For more information about accessing metrics used in these key capacity scaling indicators, see Overview of Logging and Metrics.

Diego Cell capacity scaling indicators

There are three key capacity scaling indicators VMware recommends for a Diego Cell.

Diego Cell Memory Capacity
Description	The Diego Cell Memory Capacity indicator is the percentage of remaining memory your Diego Cells can allocate to containers. Divide the `CapacityRemainingMemory` metric with the `CapacityTotalMemory` to get this percentage. The metric `CapacityRemainingMemory` is the remaining memory, in MiB, available to a Diego Cell. The metric `CapacityTotalMemory` is the total memory, in MiB, available to a Diego Cell.
Source ID	`rep`
Metrics	`CapacityRemainingMemory` `CapacityTotalMemory`
Recommended thresholds	< average (35%) This threshold assumes you have three AZs.
How to scale	Deploy additional diego cells until the average free memory is 35%. This threshold assumes you have three AZs.
Additional details	Type: Gauge (%) Frequency: Emitted every 60 s Applies to: cf:diego_cells

Diego Cell Disk Capacity
Description	The Diego Cell Disk Capacity indicator is the percentage of remaining disk capacity a given Diego Cell can allocate to containers. Divide the `CapacityRemainingDisk` metric by the `CapacityTotalDisk` metric to get this percentage. The metric `CapacityRemainingDisk` is the remaining amount of disk avaiable, in MiB, for this Diego Cell. The metric `CapacityTotalDisk` indicates the total amount of disk available, in MiB, for this Diego Cell.
Source ID	`rep`
Metrics	`CapacityRemainingDisk` `CapacityTotalDisk`
Recommended thresholds	< average (35%) This threshold assumes you have three AZs.
How to scale	Deploy additional diego cells until the average free memory is 35%. This threshold assumes you have three AZs.
Additional details	Type: Gauge (%) Frequency: Emitted every 60 s Applies to: cf:diego_cells

Diego Cell Container Capacity
Description	The Diego Cell Container Capacity indicator is the percentage of containers remaining that a given Diego Cell can host. Divide the `CapacityRemainingContainers` metric by the `CapacityTotalContainers` metric to get this percentage. The metric `CapacityRemainingContainers` is the remaining number of containers. The metric `CapacityTotalContainer` is the total number of containers.
Source ID	`rep`
Metrics	`CapacityRemainingContainers` `CapacityTotalContainers`
Recommended thresholds	< average (35%) This threshold assumes you have three AZs.
How to scale	Deploy additional diego cells until the average free memory is 35%. This threshold assumes you have three AZs.
Additional details	Type: Gauge (%) Frequency: Emitted every 60 s Applies to: cf:diego_cells

Firehose performance scaling indicators

VMware recommends three key capacity scaling indicators for monitoring Firehose performance.

Log Transport Loss Rate
Description	The Log Transport Loss Rate indicator is the rate of messages dropped between the Dopplers and the Firehouse. Divide the `dropped{direction=ingress}` metric by the `ingress` metric to get the loss rate. Metric `ingress` is the number of messages entering the Dopplers. `dropped` is the number of messages never delivered to the Firehose. For more information about Loggregator components, see Loggregator Architecture.
Source ID	`doppler`
Metrics	`dropped` `ingress`
Label	`{direction=ingress}` Dopplers emit two separate dropped metrics, one for ingress and one for egress. The envelopes have a `direction` label. For this indicator, use the metric with a `direction` tag with a value of `ingress`.
Recommended thresholds	Scale indicator: ≥ 0.01 If alerting: Yellow warning: ≥ 0.005 Red critical: ≥ 0.01 Excessive dropped messages can indicate the Dopplers or Traffic Controllers are not processing messages quickly enough.
How to scale	Scale up the number of Traffic Controller and Doppler instances. At approximately 40 Doppler instances and 20 Traffic Controller instances, horizontal scaling is no longer useful for improving Firehose performance. To improve performance, add vertical scale to the existing Doppler and Traffic Controller instances by increasing CPU resources.
Additional details	Type: Gauge (float) Frequency: Base metrics are emitted every 5 s Applies to: cf:doppler

Doppler Message Ingress Capacity
Description	The Doppler Ingress counter is the number of messages ingressed by the Doppler instance. Divide the sum of the rate of `ingress` metrics across instances by the `current number of Doppler instances` to get this average.
Source ID	`doppler`
Metrics	`ingress`
Recommended thresholds	Scale indicator: ≥ 16,000 envelopes per second (or 1 million envelopes per minute)
How to scale	Increase the number of Doppler VMs in the Resource Config pane of the TAS for VMs tile.
Additional details	Type: Counter (float) Frequency: Emitted every 5 s Applies to: cf:doppler

Reverse Log Proxy Loss Rate
Description	The Reverse Log Proxy Loss Rate indicator is the rate of bound app logs dropped from the Reverse Log Proxies (RLP). Divide the `dropped` metric by the `ingress` metric to get this indicator. This loss rate is specific to the RLP and does not impact the Firehose loss rate.
Source ID	`rlp`
Metrics	`ingress` `dropped`
Recommended thresholds	Scale indicator: ≥ 0.1 If alerting: Yellow warning: ≥ 0.01 Red critical: ≥ 0.1 Excessive dropped messages can indicate that the RLP is overloaded and that the Traffic Controllers need to be scaled.
How to scale	Scale up the number of traffic controller instances to further balance log load.
Additional details	Type: Counter (Integer) Frequency: Emitted every 60 s Applies to: cf:loggregator

Firehose consumer scaling indicator

VMware recommends the following scaling indicator for monitoring the performance of consumers of the Firehose.

Slow Consumer Drops
Description	The Slow Consumer Drops indicator is the `slow_consumer` metric incremented for each connection the Firehose closes because a consumer could not keep up. This indicator shows how fast a Firehose consumer, such as a monitoring tool nozzle, is ingesting data. If this number is anomalous, it may result in the downstream monitoring tool not having all expected data, even though that data was successfully transported through the Firehose.
Source ID	`doppler_proxy`
Metrics	`slow_consumer`
Recommended thresholds	Scale indicator: VMware recommends scaling when the rate of Firehose Slow Consumer Drops is anomalous for a given environment.
How to scale	Scale up the number of nozzle instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the Firehose. If you use the same subscription ID on each nozzle instance, the Firehose evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the Firehose sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the Firehose sends one-third of the data to each instance. If you want to scale a nozzle, the number of nozzle instances should match the number of Traffic Controller instances.
Additional details	Type: Counter Frequency: Emitted every 5 s Applies to: cf:doppler

Reverse Log Proxy Egress Dropped Messages
Description	The Reverse Log Proxy Egress Dropped Messages indicator shows the number of messages dropped when consumers of the RLP, such as monitoring tool nozzles, ingest the exiting stream of logs and metrics too slowly. Within TAS for VMs, logs and metrics enter Loggregator for transport and then egress through the Reverse Log Proxy (RLP).
Source ID	`rlp`
Metrics	`dropped`
Label	`direction: egress`
Recommended thresholds	Scale indicator: Scale when the rate of `rlp.dropped, direction: egress` metrics is continuously increasing.
How to scale	Scale up the number of nozzle instances. The number of nozzle instances should match the number of Traffic Controller instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the RLP. If you use the same subscription ID on each nozzle instance, the RLP evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the RLP sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the RLP sends one-third of the data to each instance.
Additional details	Type: Counter Frequency: Emitted every 5 s Applies to: cf:loggregator_trafficcontroller

Doppler Egress Dropped Messages
Description	The Doppler Egress Dropped Messages indicator shows the number of messages that the Dopplers drop when consumers of the RLP, such as monitoring tool nozzles, ingest the exiting stream of logs and metrics too slowly. For more information about how the Dopplers transport logs and metrics through Loggregator, see Loggregator Architecture in Loggregator Architecture. The `doppler.dropped` metric includes both `ingress` and `egress` directions. To differentiate between `ingress` and `egress`, refer to the `direction` tag on the metric.
Source ID	`doppler`
Metrics	`dropped` `egress`
Label	`direction: egress`
Recommended thresholds	Scale indicator: Scale when the rate of `doppler.dropped, direction: egress` metrics is continuously increasing.
How to scale	Scale up the number of nozzle instances. The number of nozzle instances should match the number of Traffic Controller instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the RLP. If you use the same subscription ID on each nozzle instance, the RLP evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the RLP sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the RLP sends one-third of the data to each instance.
Additional details	Type: Counter Frequency: Emitted every 5 s Applies to: cf:doppler

Syslog drain performance scaling indicators

There is a single key capacity scaling indicator VMware recommends for Syslog Drain performance.

These Syslog Drain scaling indicators are only relevant if your deployment contains apps using the syslog drain binding feature.

Syslog Agent Loss Rate
Description	Divide the `loggregator.syslog_agent.dropped{direction:egress}` metric by the `loggregator.syslog_agent.ingress{scope:all_drains}` metric to get the rate of messages dropped as a percentage of total message traffic through Syslog Agents. The message traffic through Syslog Agents includes logs for bound apps. The loss rate of Syslog Agents indicates that the Syslog Drain consumer is ingesting logs from a Syslog Drain-bound app too slowly. The Syslog Agent loss rate does not affect the Firehose loss rate. Syslog Agents can still drop messages even though the Firehose does not drop messages.
Source ID	`syslog_agent`
Metrics	`dropped` `ingress`
Label	`direction:egress` `scope:all_drains`
Recommended thresholds	The scaling indicator VMware recommends is the minimum Syslog Agent loss rate per minute within a five-minute window. You should scale up if the loss rate is greater than `0.1` for five minutes or longer. Scale indicator: ≥ 0.1 If alerting: Yellow warning: ≥ 0.01 Red critical: ≥ 0.1
How to scale	Review the logs of the syslog server for intake issues and other performance issues. Scale the syslog server if necessary.
Additional details	Type: Counter (Integer) Frequency: Emitted every 60 s

Log cache scaling indicator

VMware recommends the following scaling indicator for monitoring the performance of log cache.

Log Cache Caching Duration
Description	The Log Cache Caching Duration indicator shows the age in milliseconds of the oldest data point stored in Log Cache. Log Cache stores all messages that are passed through the Firehose in an ephemeral in-memory store. The size of this store and the cache duration are dependent on the amount of memory available on the VM where Log Cache runs. Typically, Log Cache runs on the Doppler VM.
Source ID	`log_cache`
Metrics	`log_cache_cache_period`
Recommended thresholds	Scale indicator: Scale the VM on which Log Cache runs when the `log_cache_cache_period` metric drops below 900000 milliseconds.
How to scale	Scale up the number of Doppler VMs or choose a VM type for Doppler that provides more memory.
Additional details	Type: Gauge Frequency: Emitted every 15 s Applies to: cf:log-cache

Gorouter performance scaling indicator

There is one key capacity scaling indicator VMware recommends for Gorouter performance.

The following metric appears in the Firehose in two different formats. The following table lists both formats.

Gorouter VM CPU Utilization
Description	The Gorouter VM CPU Utilization indicator shows how much of a Gorouter VM's CPU is being used. High CPU utilization of the Gorouter VMs can increase latency and cause requests per second to decrease.
Source ID	`cpu`
Metrics	`user`
Recommended thresholds	Scale indicator: ≥ 60% If alerting: Yellow warning: ≥ 60% Red critical: ≥ 70%
How to scale	Scale the Gorouters horizontally or vertically by editing the Router VM in the Resource Config pane of the TAS for VMs tile.
Additional details	Type: Gauge (float) Frequency: Emitted every 60 s Applies to: cf:router

UAA performance scaling indicator

There is one key capacity scaling indicator VMware recommends for UAA performance.

The following metric appears in the Firehose in two different formats. The following table lists both formats.

UAA VM CPU Utilization
Description	The UAA VM CPU Utilization indicator shows how much of the UAA VM's CPU is used. High CPU utilization of the UAA VMs can cause requests per second to decrease.
Source ID	`cpu`
Metrics	`user`
Recommended thresholds	Scale indicator: ≥ 80% If alerting: Yellow warning: ≥ 80% Red critical: ≥ 90%
How to scale	Scale UAA horizontally or vertically. To scale UAA, navigate to the Resource Config pane of the TAS for VMs tile and edit the number of your UAA VM instances or change the VM type to a type that utilizes more CPU cores.
Additional details	Type: Gauge (float) Frequency: Emitted every 60 s Applies to: cf:uaa

NFS/WebDAV backed blobstore

There is one key capacity scaling indicator for external S3 external storage.

This metric is only relevant if your deployment does not use an external S3 repository for external storage with no capacity constraints.

The following metric appears in the Firehose in two different formats. The following table lists both formats.

External S3 External Storage
Description	The External S3 External Storage indicator shows the percentage of persistent disk used. If applicable: Monitor the percentage of persistent disk used on the VM for the NFS Server job. If you do not use an external S3 repository for external storage with no capacity constraints, you must monitor the TAS for VMs object store to push new app and buildpacks.
Source ID	`disk`
Metrics	`persistent.percent`
Recommended thresholds	≥ 75%
How to scale	Give your NFS Server additional persistent disk resources. If you use an internal NFS/WebDAV backed blobstore, consider scaling the persistent disk when it reaches 75% capacity.
Additional details	Type: Gauge (%) Applies to: cf:nfs_server