VMware provides these indicators to operators as general guidance for capacity scaling. Each indicator is based on all platform metrics from all components.
This guidance is applicable to most TAS for VMs deployments. VMware recommends that operators fine-tune the suggested alert thresholds by observing historical trends for their deployments.
For more information about accessing metrics used in these key capacity scaling indicators, see Overview of Logging and Metrics.
There are three key capacity scaling indicators VMware recommends for a Diego Cell.
Diego Cell memory capacity |
|
---|---|
Description | The Diego Cell Memory Capacity indicator is the percentage of remaining memory your Diego Cells can allocate to containers. Divide the CapacityRemainingMemory metric with the CapacityTotalMemory to get this percentage.The metric CapacityRemainingMemory is the remaining memory, in MiB, available to a Diego Cell.The metric CapacityTotalMemory is the total memory, in MiB, available to a Diego Cell. |
Source ID | rep |
Metrics | CapacityRemainingMemory CapacityTotalMemory |
Recommended thresholds | < average (35%) This threshold assumes you have three AZs. |
How to scale | Deploy additional diego cells until the average free memory is 35%. This threshold assumes you have three AZs. |
Additional details | Type: Gauge (%) Frequency: Emitted every 60 s Applies to: cf:diego_cells |
Diego Cell disk capacity |
|
---|---|
Description | The Diego Cell Disk Capacity indicator is the percentage of remaining disk capacity a given Diego Cell can allocate to containers. Divide the CapacityRemainingDisk metric by the CapacityTotalDisk metric to get this percentage. The metric CapacityRemainingDisk is the remaining amount of disk avaiable, in MiB, for this Diego Cell.The metric CapacityTotalDisk indicates the total amount of disk available, in MiB, for this Diego Cell. |
Source ID | rep |
Metrics | CapacityRemainingDisk CapacityTotalDisk |
Recommended thresholds | < average (35%) This threshold assumes you have three AZs. |
How to scale | Deploy additional diego cells until the average free memory is 35%. This threshold assumes you have three AZs. |
Additional details | Type: Gauge (%) Frequency: Emitted every 60 s Applies to: cf:diego_cells |
Diego Cell Container Capacity |
|
---|---|
Description | The Diego Cell Container Capacity indicator is the percentage of containers remaining that a given Diego Cell can host. Divide the CapacityRemainingContainers metric by the CapacityTotalContainers metric to get this percentage.The metric CapacityRemainingContainers is the remaining number of containers.The metric CapacityTotalContainer is the total number of containers. |
Source ID | rep |
Metrics | CapacityRemainingContainers CapacityTotalContainers |
Recommended thresholds | < average (35%) This threshold assumes you have three AZs. |
How to scale | Deploy additional diego cells until the average free memory is 35%. This threshold assumes you have three AZs. |
Additional details | Type: Gauge (%) Frequency: Emitted every 60 s Applies to: cf:diego_cells |
VMware recommends three key capacity scaling indicators for monitoring Firehose performance.
Log Transport Loss Rate |
|
---|---|
Description | The Log Transport Loss Rate indicator is the rate of messages dropped between the Dopplers and the Firehouse. Divide the dropped{direction=ingress} metric by the ingress metric to get the loss rate.Metric ingress is the number of messages entering the Dopplers. dropped is the number of messages never delivered to the Firehose.For more information about Loggregator components, see Loggregator Architecture. |
Source ID | doppler |
Metrics | dropped ingress |
Label | {direction=ingress} Dopplers emit two separate dropped metrics, one for ingress and one for egress. The envelopes have a direction label. For this indicator, use the metric with a direction tag with a value of ingress . |
Recommended thresholds | Scale indicator: ≥ 0.01 If alerting: Yellow warning: ≥ 0.005 Red critical: ≥ 0.01 Excessive dropped messages can indicate the Dopplers or Traffic Controllers are not processing messages quickly enough. |
How to scale | Scale up the number of Traffic Controller and Doppler instances. At approximately 40 Doppler instances and 20 Traffic Controller instances, horizontal scaling is no longer useful for improving Firehose performance. To improve performance, add vertical scale to the existing Doppler and Traffic Controller instances by increasing CPU resources. |
Additional details | Type: Gauge (float) Frequency: Base metrics are emitted every 5 s Applies to: cf:doppler |
Doppler Message Ingress Capacity | |
---|---|
Description | The Doppler Ingress counter is the number of messages ingressed by the Doppler instance. Divide the sum of the rate of ingress metrics across instances by the current number of Doppler instances to get this average. |
Source ID | doppler |
Metrics | ingress |
Recommended thresholds | Scale indicator: ≥ 16,000 envelopes per second (or 1 million envelopes per minute) |
How to scale | Increase the number of Doppler VMs in the Resource Config pane of the TAS for VMs tile. |
Additional details | Type: Counter (float) Frequency: Emitted every 5 s Applies to: cf:doppler |
Reverse Log Proxy Loss Rate |
|
---|---|
Description | The Reverse Log Proxy Loss Rate indicator is the rate of bound app logs dropped from the Reverse Log Proxies (RLP). Divide the dropped metric by the ingress metric to get this indicator.This loss rate is specific to the RLP and does not impact the Firehose loss rate. |
Source ID | rlp |
Metrics | ingress dropped |
Recommended thresholds | Scale indicator: ≥ 0.1 If alerting: Yellow warning: ≥ 0.01 Red critical: ≥ 0.1 Excessive dropped messages can indicate that the RLP is overloaded and that the Traffic Controllers need to be scaled. |
How to scale | Scale up the number of traffic controller instances to further balance log load. |
Additional details | Type: Counter (Integer) Frequency: Emitted every 60 s Applies to: cf:loggregator |
VMware recommends the following scaling indicator for monitoring the performance of consumers of the Firehose.
Slow Consumer Drops |
|
---|---|
Description | The Slow Consumer Drops indicator is the slow_consumer metric incremented for each connection the Firehose closes because a consumer cannot keep up.This indicator shows how fast a Firehose consumer, such as a monitoring tool nozzle, is ingesting data. If this number is anomalous, it might result in the downstream monitoring tool not having all expected data, even though that data was successfully transported through the Firehose. |
Source ID | doppler_proxy |
Metrics | slow_consumer |
Recommended thresholds | Scale indicator: VMware recommends scaling when the rate of Firehose Slow Consumer Drops is anomalous for a given environment. |
How to scale | Scale up the number of nozzle instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the Firehose. If you use the same subscription ID on each nozzle instance, the Firehose evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the Firehose sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the Firehose sends one-third of the data to each instance. If you want to scale a nozzle, the number of nozzle instances should match the number of Traffic Controller instances. |
Additional details | Type: Counter Frequency: Emitted every 5 s Applies to: cf:doppler |
Reverse Log Proxy Egress Dropped Messages |
|
---|---|
Description | The Reverse Log Proxy Egress Dropped Messages indicator shows the number of messages dropped when consumers of the RLP, such as monitoring tool nozzles, ingest the exiting stream of logs and metrics too slowly. Within TAS for VMs, logs and metrics enter Loggregator for transport and then egress through the Reverse Log Proxy (RLP). |
Source ID | rlp |
Metrics | dropped |
Label | direction: egress |
Recommended thresholds | Scale indicator: Scale when the rate of rlp.dropped, direction: egress metrics is continuously increasing. |
How to scale | Scale up the number of nozzle instances. The number of nozzle instances should match the number of Traffic Controller instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the RLP. If you use the same subscription ID on each nozzle instance, the RLP evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the RLP sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the RLP sends one-third of the data to each instance. |
Additional details | Type: Counter Frequency: Emitted every 5 s Applies to: cf:loggregator_trafficcontroller |
Doppler Egress Dropped Messages |
|
---|---|
Description | The Doppler Egress Dropped Messages indicator shows the number of messages that the Dopplers drop when consumers of the RLP, such as monitoring tool nozzles, ingest the exiting stream of logs and metrics too slowly. For more information about how the Dopplers transport logs and metrics through Loggregator, see Loggregator Architecture in Loggregator Architecture. The |
Source ID | doppler |
Metrics | dropped egress |
Label | direction: egress |
Recommended thresholds | Scale indicator Scale when the rate of doppler.dropped, direction: egress metrics is continuously increasing. |
How to scale | Scale up the number of nozzle instances. The number of nozzle instances should match the number of Traffic Controller instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the RLP. If you use the same subscription ID on each nozzle instance, the RLP evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the RLP sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the RLP sends one-third of the data to each instance. |
Additional details | Type: Counter Frequency: Emitted every 5 s Applies to: cf:doppler |
There is a single key capacity scaling indicator VMware recommends for Syslog Drain performance.
These Syslog Drain scaling indicators are only relevant if your deployment contains apps using the syslog drain binding feature.
Syslog Agent Loss Rate |
|
---|---|
Description | Divide the syslog_agent.dropped{direction:egress} metric by the syslog_agent.ingress{scope:agent} metric to get the rate of messages dropped as a percentage of total message traffic through Syslog Agents. The loss rate of Syslog Agents indicates that the syslog drain destination is accepting logs too slowly.The Syslog Agent loss rate does not affect the Firehose loss rate. Message loss can occur in Syslog Agents without message loss occurring in the Firehose. |
Source ID | syslog_agent |
Metrics | dropped ingress |
Label | direction:egress scope:agent |
Recommended thresholds | The scaling indicator VMware recommends is the minimum Syslog Agent loss rate per minute within a five-minute window. You should scale up if the loss rate is greater than 0.1 for five minutes or longer.Scale indicator: ≥ 0.1 If alerting: Yellow warning: ≥ 0.01 Red critical: ≥ 0.1 |
How to scale | Review the logs of the syslog destinations for intake issues and other performance issues. Scale the syslog destinations if necessary. A high Syslog Agent loss rate can indicate that Log Cache is unable to keep up with the ingestion rate for logs. If Log Cache is using Syslog Ingress and is CPU bound, this is likely the source of the syslog drops. To scale Log Cache to have more CPU resources, add more instances or choose instance types with more CPU capacity. |
Additional details | Type: Counter (Integer) Frequency: Emitted every 60 s |
VMware recommends the following scaling indicator for monitoring the performance of log cache.
Log Cache Caching Duration |
|
---|---|
Description | The Log Cache Caching Duration indicator shows the age in milliseconds of the oldest data point stored in Log Cache. Log Cache stores all messages that are passed through the Firehose in an ephemeral in-memory store. The size of this store and the cache duration are dependent on the amount of memory available on the Log Cache VM. |
Source ID | log_cache |
Metrics | log_cache_cache_period |
Recommended thresholds | Scale indicator: Scale the Log Cache VMs when the log_cache_cache_period metric drops below 900000 milliseconds. |
How to scale | Increase the number of Log Cache VMs in the Resource Config pane of the TAS for VMs tile, or choose a VM type that provides more memory. |
Additional details | Type: Gauge Frequency: Emitted every 15 s Applies to: cf:log-cache |
There is one key capacity scaling indicator VMware recommends for Gorouter performance.
The following metric appears in the Firehose in two different formats. The following table lists both formats.
Gorouter VM CPU Utilization |
|
---|---|
Description | The Gorouter VM CPU Utilization indicator shows how much of a Gorouter VM's CPU is being used. High CPU utilization of the Gorouter VMs can increase latency and cause requests per second to decrease. |
Source ID | cpu |
Metrics | user |
Recommended thresholds | Scale indicator: ≥ 60% If alerting: Yellow warning: ≥ 60% Red critical: ≥ 70% |
How to scale | Scale the Gorouters horizontally or vertically by editing the Router VM in the Resource Config pane of the TAS for VMs tile. |
Additional details | Type: Gauge (float) Frequency: Emitted every 60 s Applies to: cf:router |
There is one key capacity scaling indicator VMware recommends for UAA performance.
The following metric appears in the Firehose in two different formats. The following table lists both formats.
UAA VM CPU Utilization |
|
---|---|
Description | The UAA VM CPU Utilization indicator shows how much of the UAA VM's CPU is used. High CPU utilization of the UAA VMs can cause requests per second to decrease. |
Source ID | cpu |
Metrics | user |
Recommended thresholds | Scale indicator: ≥ 80% If alerting: Yellow warning: ≥ 80% Red critical: ≥ 90% |
How to scale | Scale UAA horizontally or vertically. To scale UAA, navigate to the Resource Config pane of the TAS for VMs tile and edit the number of your UAA VM instances or change the VM type to a type that uses more CPU cores. |
Additional details | Type: Gauge (float) Frequency: Emitted every 60 s Applies to: cf:uaa |
There is one key capacity scaling indicator for external S3 external storage.
This metric is only relevant if your deployment does not use an external S3 repository for external storage with no capacity constraints.
The following metric appears in the Firehose in two different formats. The following table lists both formats.
External S3 External Storage |
|
---|---|
Description | The External S3 External Storage indicator shows the percentage of persistent disk used. If applicable: Monitor the percentage of persistent disk used on the VM for the NFS Server job. If you do not use an external S3 repository for external storage with no capacity constraints, you must monitor the TAS for VMs object store to push new app and buildpacks. |
Source ID | disk |
Metrics | persistent.percent |
Recommended thresholds | ≥ 75% |
How to scale | Give your NFS Server additional persistent disk resources. If you use an internal NFS/WebDAV backed blobstore, consider scaling the persistent disk when it reaches 75% capacity. |
Additional details | Type: Gauge (%) Applies to: cf:nfs_server |