Use HTTP Request Latency as a scaling metric with App Autoscaler

You can configure App Autoscaler to use the HTTP Request Latency metric to scale apps in your VMware Tanzu Application Service for VMs (TAS for VMs) deployment.

HTTP Request Latency Overview

When an HTTP request is made to an app, the Gorouter in TAS for VMs generates several metrics. One of these metrics is gorouter.latency, or HTTP Request Latency. The HTTP Request Latency metric measures the length of time required to process an HTTP request, starting when the Gorouter receives a request and ending when the Gorouter completes processing the response from the app. This metric includes the length of time required for all back-end endpoints to respond, including other apps and TAS for VMs components such as Cloud Controller and UAA. Long uploads, downloads, or app responses increase the time.

For example, you might have a Service Level Agreement (SLA) specifying that 95% of requests for an app must be processed in less than 300-milliseconds. To help achieve this, you can configure an autoscaling rule for Autoscaler to create additional instances of the app when the HTTP Request Latency metric reaches 250 milliseconds.

You can configure Autoscaler to use HTTP Request Latency as the scaling metric for an app in the following ways:

Through the Cloud Foundry Command-Line Interface (cf CLI). For more information, see Configure HTTP Request Latency as the scaling metric for an app through the cf CLI.
Through Apps Manager. For more information, see Configure HTTP Request Latency as the scaling metric for an app through Apps Manager.

To monitor when Autoscaler scales an app based on changes in HTTP Request Latency, see Reviewing autoscaling events for changes in HTTP Request Latency.

For information about use cases that might complicate or prevent you from configuring HTTP Request Latency as the scaling metric for an app, see Special considerations for using HTTP Request Latency as a scaling metric.

VMware recommends that you load-test your app to verify that the autoscaling rules you configured are effective. For more information, see Load-testing your app in Using Autoscaler in production.

For more information about the HTTP Request Latency metric, see Router handling latency in Key Performance Indicators. For more information about how TAS for VMs routes HTTP requests, see TAS for VMs Routing Architecture.

Configure HTTP Request Latency as the scaling metric for an app through the cf CLI

The procedures in this section describe how to configure Autoscaler to use HTTP Request Latency as the scaling metric for an app through the cf CLI.

You can configure Autoscaler to use HTTP Request Latency as the scaling metric for an app in the following ways:

Using a manifest file. For more information, see Configure an autoscaling rule using a manifest file.
Using CLI commands. For more information, see Configure an autoscaling Rule using CLI commands.

For the procedures in this section, you must use the App Autoscaler CLI plug-in. To download and install the App Autoscaler CLI plug-in, see Install the App Autoscaler CLI plug-in in Using the App Autoscaler CLI.

Configure an autoscaling rule using a manifest file

You can configure autoscaling rules declaratively through a manifest file. This manifest file only configures Autoscaler, and does not interfere with any other existing app manifest files in your TAS for VMs deployment.

To configure an autoscaling rule that defines HTTP Request Latency as its scaling metric using a manifest file:

In a terminal window, target the space in which the app you want to scale is deployed by running:
```
cf target -o ORG-NAME -s SPACE-NAME
```
Where:
- ORG-NAME is the name of the org containing the space in which the app you want to scale is deployed.
- SPACE-NAME is the name of the space in which the app you want to scale is deployed.
If the space in which the app you want to scale is deployed does not already have an Autoscaler service instance of Autoscaler deployed in it, create an Autoscaler service instance by running:
```
cf create-service app-autoscaler PLAN-NAME SERVICE-NAME
```
Where:
- PLAN-NAME is the name of the service plan you want to use for the Autoscaler service instance.
- SERVICE-INSTANCE-NAME is the name you want to give the Autoscaler service instance. For example, autoscaler.
If there is already an Autoscaler service instance in the space in which the app you want to scale is deployed, skip this step.
Bind the Autoscaler service instance you created in the previous step to the app you want to scale by running:
```
cf bind-service APP-NAME SERVICE-INSTANCE-NAME
```
Where:
- APP-NAME is the name of the app you want to scale.
- SERVICE-INSTANCE-NAME is the name of the Autoscaler service instance in the previous step.
To create a manifest file for Autoscaler that configures an autoscaling rule with HTTP Request Latency as its scaling metric, create a YAML file that includes the following configuration parameters:
```
---
instance_limits:
  min: LOWER-SCALING-LIMIT
  max: UPPER-SCALING-LIMIT
rules:
- rule_type: http_latency
  rule_sub_type: PERCENTILE
  threshold:
    min: MINIMUM-LATENCY-THRESHOLD
    max: MAXIMUM-LATENCY-THRESHOLD
scheduled_limit_changes: []
```
Where:
- LOWER-SCALING-LIMIT is the minimum number of instances you want Autoscaler to create for the app.
- UPPER-SCALING-LIMIT is the maximum number of instances you want Autoscaler to create for the app.
- PERCENTILE is the percentile that Autoscaler uses in scaling decisions. Valid values are avg_95th or avg_99th. This value configures Autoscaler to ignore HTTP requests that fall outside either the 95th or 99th percentile and average the latency of the remaining 95% or 99% of HTTP requests.
- MINIMUM-LATENCY-THRESHOLD is the minimum HTTP Request Latency threshold in milliseconds. If the average latency of HTTP requests falls below this number, Autoscaler scales the number of app instances down.
- MAXIMUM-LATENCY-THRESHOLD is the maximum HTTP Request Latency threshold in milliseconds. If the average latency of HTTP requests rises above this number, Autoscaler scales the number of app instances up. To avoid excessive cycling, VMware recommends that you configure a maximum threshold that is at least twice the value of the minimum threshold.
The following example shows an Autoscaler manifest file with a percentile of 95%, a minimum HTTP Request Latency threshold of 125 milliseconds, and a maximum HTTP Request Latency threshold of 250 milliseconds:
```
---
instance_limits:
  min: 10
  max: 100
rules:
- rule_type: http_latency
  rule_sub_type: avg_95th
  threshold:
    min: 125
    max: 250
scheduled_limit_changes: []
```
Apply the autoscaling rule you configured in the previous step to the app you want to scale by running:
```
cf configure-autoscaling APP-NAME MANIFEST-FILENAME
```
Where:
- APP-NAME is the name of the app.
- MANIFEST-FILENAME is the filename of the manifest file you created in the previous step. For example, autoscaler.yml.

Configure an autoscaling rule using CLI commands

To configure an autoscaling rule that defines HTTP Request Latency as its scaling metric using CLI commands:

In a terminal window, target the space in which the app you want to scale is deployed by running:
```
cf target -o ORG-NAME -s SPACE-NAME
```
Where:
- ORG-NAME is the name of the org containing the space in which the app you want to scale is deployed.
- SPACE-NAME is the name of the space in which the app you want to scale is deployed.
If the space in which the app you want to scale is deployed does not already have a service instance of Autoscaler deployed in it, create an Autoscaler service instance by running:
```
cf create-service app-autoscaler PLAN-NAME SERVICE-INSTANCE-NAME
```
Where:
- PLAN-NAME is the name of the service plan you want to use for the Autoscaler service instance.
- SERVICE-INSTANCE-NAME is the name you want to give the Autoscaler service instance. For example, autoscaler.
If there is already an Autoscaler service instance in the space in which the app you want to scale is deployed, skip this step.
Bind the Autoscaler service instance you created in the previous step to the app you want to scale by running:
```
cf bind-service APP-NAME SERVICE-INSTANCE-NAME
```
Where:
- APP-NAME is the name of the app you want to scale.
- SERVICE-INSTANCE-NAME is the name of the Autoscaler service instance in the previous step.
Configure upper and lower scaling limits for the app by running:
```
cf update-autoscaling-limits APP-NAME LOWER-SCALING-LIMIT UPPER-SCALING-LIMIT
```
Where:
- APP-NAME is the name of the app.
- LOWER-SCALING-LIMIT is the minimum number of instances you want Autoscaler to create for the app.
- UPPER-SCALING-LIMIT is the maximum number of instances you want Autoscaler to create for the app.
Allow Autoscaler to begin making scaling decisions for the app by running:
```
cf enable-autoscaling APP-NAME
```
Where APP-NAME is the name of the app.
Create an http_latency autoscaling rule by running:
```
cf create-autoscaling-rule APP-NAME http_latency MINIMUM-LATENCY-THRESHOLD MAXIMUM-LATENCY-THRESHOLD --subtype PERCENTILE
```
Where:
- APP-NAME is the name of the app for which you want to create an autoscaling rule.
- MINIMUM-LATENCY-THRESHOLD is the minimum HTTP Request Latency threshold in milliseconds. If the average latency of HTTP requests falls below this number, Autoscaler scales the number of app instances down.
- MAXIMUM-LATENCY-THRESHOLD is the maximum HTTP Request Latency threshold in milliseconds. If the average latency of HTTP requests rises above this number, Autoscaler scales the number of app instances up. To avoid excessive cycling, VMware recommends that you configure a maximum threshold that is at least twice the value of the minimum threshold.
- PERCENTILE is the percentile that Autoscaler uses in scaling decisions. Valid values are avg_95th or avg_99th. This value configures Autoscaler to ignore HTTP requests that fall outside either the 95th or 99th percentile and average the latency of the remaining 95% or 99% of HTTP requests.
The following example command configures an http_latency autoscaling rule for the example-app app, with a minimum HTTP Request Latency threshold of 125-milliseconds, a maximum HTTP Request Latency threshold of 250-milliseconds, and a percentile of 95%:
```
cf create-autoscaling-rule example-app http_latency 125 250 --subtype avg_95th
```

Configure HTTP Request Latency as the scaling metric for an app through Apps Manager

To configure Autoscaler to use HTTP Request Latency as the scaling metric for an app through Apps Manager:

Log in to Apps Manager. For more information, see Logging in to Apps Manager.
Select the org that contains the space in which the app you want to scale is deployed.
Select the space in which the app you want to scale is deployed.
Under Under Processes and Instances, click Manage Autoscaling. The Manage Autoscaling window appears.
Next to Scaling Rules, click Edit. The Edit Scaling Rules window appears.
Click Add rule. The Select type drop-down menu appears.
From the Select type drop-down menu, select HTTP Request Latency.
1. For Scale down if less than, enter in milliseconds the minimum HTTP Request Latency threshold you want to configure. If the average latency of HTTP requests falls below this number, Autoscaler scales the number of app instances down.
2. For Scale up if more than, enter in milliseconds the maximum HTTP Request Latency threshold you want to configure. If the average latency of HTTP requests rises above this number, Autoscaler scales the number of app instances up. To avoid excessive cycling, VMware recommends that you configure a maximum threshold that is at least twice the value of the minimum threshold.
3. Under Percent of traffic to apply, select either 95% or 99%. This configuration setting is the percentile that Autoscaler uses in scaling decisions. Depending on which option you select, Autoscaler ignores HTTP requests that fall outside either the 95th or 99th percentile and averages the latency of the remaining 95% or 99% of HTTP requests.
Click Save.

Reviewing autoscaling events for changes in HTTP Request Latency

When Autoscaler scales the number of app instances up after the HTTP Request Latency metric increases above the maximum HTTP Request Latency threshold, Autoscaler records an autoscaling event.

You can monitor the autoscaling events that Autoscaler records for changes in HTTP Request Latency in the following ways:

Through the cf CLI. See Review autoscaling events for changes in HTTP Request Latency through the cf CLI.
Through Apps Manager. See Review autoscaling events for changes in HTTP Request Latency through Apps Manager.

Review autoscaling events for changes in HTTP Request Latency through the cf CLI

To review the autoscaling events that Autoscaler records for changes in HTTP Request Latency through the cf CLI:

In a terminal window, run:
```
cf autoscaling-events APP-NAME
```
Where APP-NAME is the name of the app for which you want to review autoscaling events.

If Autoscaler has scaled the number of app instances up due to increases in the HTTP Request Latency metric, the above command returns output that contains autoscaling events similar to the following example:
```
Time                   Description
2022-05-23T21:47:45Z   Scaled up from 10 to 11 instances. Current HTTP Latency of 1010.96ms is above upper threshold of 250.00ms.
```

Review autoscaling events for changes in HTTP Request Latency through Apps Manager

To review the autoscaling events that Autoscaler records for changes in HTTP Request Latency through Apps Manager:

Log in to Apps Manager. For more information, see Logging in to Apps Manager.
Select the org that contains the space in which the app you want to scale is deployed.
Select the space in which the app you want to scale is deployed.
Under Under Processes and Instances, click Manage Autoscaling.
Under Event History, click View More. A list of autoscaling events appears. If Autoscaler has scaled the number of app instances up due to increases in the HTTP Request Latency metric, the list of autoscaling events includes events similar to the following example:
```
Scaled up from 10 to 11 instances. Current HTTP Latency of 1010.96ms is above upper threshold of 250.00ms.
```

Special considerations for using HTTP Request Latency as a scaling metric

This section describes use cases that might complicate or prevent you from configuring HTTP Request Latency as the scaling metric for an app.

Multiple endpoints

In an app that exposes multiple endpoints, the value of the HTTP Request Latency metric is the average HTTP Request Latency across all app endpoints. If one or more endpoints process requests at a slower rate than the others, HTTP Request Latency might not be an ideal scaling metric to use. Even a fast endpoint might cause the average HTTP Request Latency to increase if it receives a large number of requests.

External factors

Components or services that receive data from an app are known as downstream components. If any downstream components respond slowly to requests from an app, they might cause HTTP Request Latency to increase. In this case, scaling the app up does not improve its performance. In fact, scaling the app up might increase HTTP Request Latency, because requests from the additional app instances add a greater burden on the downstream component. Before attempting to improve the performance of the app by scaling it up, determine whether the downstream component can be scaled up or improved.

Other external factors, such as network congestion or database performance, can also cause HTTP Request Latency to increase. In this case, scaling the app does not decrease HTTP Request Latency that results from these external factors.

Container-to-Container Networking

Autoscaler can only use HTTP Request Latency as a scaling metric for apps that receive requests directly through the Gorouter. Autoscaler does not support using HTTP Request Latency as a scaling metric for apps that receive requests from other apps through container-to-container (C2C) networking or TCP routers.

If your app relies on back-end HTTP services that apps in your TAS for VMs deployment must access through C2C networking, the Gorouter does not generate HTTP events for those requests. As a result, Autoscaler cannot scale those HTTP services. For Autoscaler to scale them, you must either use a different default scaling metric or create a custom scaling metric for them.

Log cache ejection

Autoscaler retrieves HTTP metrics from Log Cache, which might hold a maximum of 100,000 envelopes per app by default. If your app receives a large number of HTTP requests or is configured to create very verbose logs, Log Cache might drop some of the timer envelopes that it holds. If Autoscaler can only retrieve some of the total timer envelopes it requires to calculate accurate metrics, then the HTTP Request Latency metric might inaccurately represent the actual HTTP Request Latency of the app or causes of decreased app performance. However, in most cases, the HTTP Request Latency metric still approximates the actual HTTP Request Latency of the app.

For more information, see Log Cache in Operating App Autoscaler.

Infrequent requests

If an app receives requests infrequently and responds slowly, Autoscaler might continue scaling the app up because there are no other HTTP Request Latency metrics to restore the average. In this case, Autoscaler usually stops scaling the app up after the original request falls outside of the metric collection interval.

For more information about how Autoscaler’s metric collection interval affects its scaling decisions, see How App Autoscaler decides when to scale in About App Autoscaler.