Find details about some of the Tanzu Service Mesh Service Autoscaler configuration properties, including recommendations on use of specific configuration values.

spec.scaleTargetRef.kubernetes

The three entries that identify a Kubernetes target reference are apiVersion, kind, and name. Tanzu Service Mesh Service Autoscaler acts on a scalable Kubernetes object. You can optionally set the Kubernetes API version of the kind for the apiVersion property. An example apiVersion is apps/v1. The valid scalable Kubernetes object types are Deployment, ReplicaSet, and StatefulSet, which is set as scaleTargetRef.kubernetes.kind. scaleTargetRef.kubernetes.name is the name of the Deployment, ReplicaSet, or StatefulState that the autoscaler will target. This scalable object needs to be in the same namespace as the autoscaler definition.

Important:

Ensure that there is only one autoscaling definition per Kubernetes object with the name set in scaleTargetRef.kubernetes.name.

spec.scaleRule.enabled

Scale Rule has several configurable properties for autoscaling. To enable scaling on a namespace in which the application operates, set enabled to true. To disable scaling, set enabled to false. Even if scaling is disabled, scaling calculations are made and can be seen in the status output of the autoscaling custom resource, allowing for a dry run of the autoscaler or advisory mode.

spec.scaleRule.mode

Another configurable property is mode. You can set mode to be EFFICIENCY or PERFORMANCE. The performance mode only scales up when resource usage exceeds the trigger.metric.scaleUp target. It does not scale down even if resources are being underused. This mode attempts to ensure the system is always ready for spikes due to high user demand.

Efficiency mode attempts to prevent overprovisioning of resources by reducing the number of running service instances when resource usage is sufficiently under the trigger.metric.scaleDown target.

spec.scaleRule.trigger.gracePeriodSeconds

To ensure stable operations, implement a grace period by setting a positive integer value for trigger.gracePeriodSeconds. If a grace period is implemented and a scale-down action is required, it there will be a grace period during which no scale-down occurs following any scaling event. Because user experience is very important, and a delayed response to increasing demand is not desired, grace period does not apply to scale up situations. If the grace period is not set, it is set to 300 seconds by default.

If no grace period is desired, set it to 0 seconds. A low grace period value allows for more frequent changes in scaling because a scale-down happens after a shorter wait time. On the other hand, a system has more time to stabilize with a longer grace period. Setting the grace time should depend on how long it takes a service instance to come online and become effective. It should give enough time after a scaling-up or scaling-down event to be ready for a scale-down action. Generally, front-end web services, which are often stateless, can start accepting traffic quickly (within 30 seconds). However, a database replica node can take approximately 15 minutes to sync and prepare to store more data.

spec.scaleRule.instances.min, spec.scaleRule.instances.max

Under scaleRule.instances, there are five configurable properties: min, max, stepsDown, stepsUp, and default. The scaleRule.instances.min and scaleRule.instances.max limit how many service instances the autoscaler can set. The min must be at least one service instance and less than what is set for scaleRule.instances.max.

To ensure availability, at least two is recommended, but the minimum number of service instances should be whatever is required to maintain a positive user experience. A scaleRule.instances.min of zero service instances is not allowed because it will shut that service down entirely. The largest allowed value for max is 1000, and the lowest must be greater than the value of scaleRule.instances.min.

You can determine a value that reflects the app’s needs with benchmark testing along with the knowledge of the cluster’s capacity. You can use a benchmark test to show how many of a cluster’s resources are used by various service instances. Based on that, you can make an estimate of how many service instances can run on a cluster. Be sure to test all the apps that share the cluster. Other factors to consider are dependencies that bottlenecks in the service. The max should be a number where having any more instances will not be helpful due to the dependency bottleneck.

spec.scaleRule.instances.stepsUp, spec.scaleRule.instances.stepsDown

The next properties under scaleRule.instances are stepsUp and stepsDown. Normally, if these are not set, the autoscaler attempts to scale up and down roughly in the proportion that the resources are being used. The resource usage proportion is approximately the average usage of each service instance divided by the resources that have been requested for each service instance (that is, the requested resources). When these properties are set, whenever scaling occurs, it always increases by stepsUp or stepsDown values, as appropriate, if the min and max limits are respected.

Generally, for stateless services, use proportional control. However, ensure slower scale-up by setting stepsUp to a low number for services that have long start-up times. Another reason to limit scaling with stepsUp might be high license costs or having to provision more and expensive storage for each scale-up event. As for stepsDown, it might be useful to limit how fast an app scales down. Another reason for using stepsUp and stepsDown is for when a metric, like latency, is used. Because latency can change drastically with time, it is difficult to calculate an accurate proportion to scale to.

spec.scaleRule.instances.default

The final property under scaleRule.instances is default. This value must be greater than or equal to the scaleRule.instances.min and less than or equal to scaleRule.instances.max. Understanding default means understanding how metrics are used to determine scaling.

Using CPU usage as an example, desired service instances are calculated by how much CPU each service instance is using. Thus, if the average CPU usage of all the service instances exceeds the scaleUp target, or is below the scaleDown target, a desired number of service instances are determined. However, if fewer than one third of the service instances’ CPU metrics can be read, scaling has insufficient information and cannot accurately calculate the desired service instance count.

In this situation, and if the current number of service instances is less than default, a scale-up occurs to meet the default number. If the number of service instances is greater than default, a scale-down does not occur to prevent disruption to user experience. If default is not set, in the case of insufficient metrics information, no scaling action is taken. Consider default as the minimum number of service instances to have if there is insufficient metrics information.

spec.scaleRule.trigger.metric.name

There are four configurable properties that are related to metrics under scaleRule.trigger.metric: name, scaleUp, scaleDown, and windowSeconds. Metrics are the measurements of resource usage or quality of service that autoscaling can be based on. The most useful metrics to determine the number of needed service instances are CPU and memory usage. These metrics correspond to the spec.scaleRule.trigger.metric.name values of CPUUsagePercent, CPUUsageMillicores, MemoryUsagePercent, and MemoryUsageBytes. The metrics ending in “Percent” are derived from the resources.requests for every container specified in the Kubernetes deployment manifest for the application being scaled. The reason for supporting both percentage and absolute values is to have a more stable (less “jittery”) number of service instances during scaling.

CPU and memory are preferred metrics on which scaling is based because they are direct measures of the resources needed to run services. However, other metrics are also available for basing scaling targets. They are RequestsPS (requests per second), Requests, p50Latency, p90Latency, and p99Latency.

Requests is the number of requests that have occurred in the metrics window time frame (see the spec.scaleRule.trigger.metric.windowSeconds section). Theoretically, as the number of requests increase, scaling up needs to occur to keep up with demand. If developers are familiar with requests and the number of service instances required to serve them, this can be the metric to base scaling on. The latency metrics are measured in milliseconds in different percentiles, where the first three characters of p##Latency represents the percentile. Fifty percent of requests have lower latencies (faster response times) than the value of p50Latency. Ninety percent of requests have lower latencies than the p90Latency value. Ninety-nine percent of requests have lower latencies than the p99Latency value.

For scaling, as latencies increase, consider increasing the number of service instances serving requests. Be aware that high latencies may occur not because of insufficient service instances running, but rather because of dependencies external to the cluster. In such a case, the autoscaler attempts to scale up without success, eventually reaching the maximum number of service instances and risking wasting of resources. Only apply latency metrics to services without dependencies. Because latencies and requests per time can change drastically moment to moment, it can be difficult for the autoscaler to calculate an accurate proportion with which to scale. If these volatile metrics are used, use stepsUp and stepsDown instead of proportionate scaling.

spec.scaleRule.trigger.metric.scaleUp, spec.scaleRule.trigger.metric.scaleDown

The next two configurations in metrics are scaleUp and scaleDown. Unlike other autoscalers that have a single target resource level that services aim to achieve, Tanzu Service Mesh Service Autoscaler has two thresholds on the service instance level at which scaling up and down are triggered. You can set the scaleUp and scaleDown values close together to define a narrow band within which service instances are expected to run. In such a case, depending on the stability of the measured metrics, the autoscaler might need to perform frequent scaling actions to maintain the small range of desired operation levels. In contrast, you can set the scaleUp and scaleDown values far apart to have a wide range of operation levels in which scaling is not required.

The units of the scaleUp and scaleDown values must match the units of the metric set in metric.name. For example, if MemoryUsagePercent is used, the scaleUp and scaleDown values must also be a percentage. Percent values are in relation to the resources.requests for every container specified in the Kubernetes deployment manifest for the application being scaled. Continuing the example, if there are 10 current service instances, scaleUp is set to 80, and on average, each of the service instances are using 90%, a scale-up occurs (if max service instances is not exceeded). Similarly, if scaleDown is set to 30 and the service instances are on average using 25%, a scale-down occurs, if the mode is set to efficiency and desired replicas are at or greater than the min number of service instances.

spec.scaleRule.trigger.metric.windowSeconds

The last configuration is windowSeconds. The default setting is 600 seconds (10 minutes), but you can set it between 60 and 3600 seconds. This setting is the length of time in which metrics are gathered before they are used to determine the average metric level that the service instances are consuming. The average is compared to the scaleUp and scaleDown threshold values to determine if scaling should occur. Systems with high volatility in resource usage might benefit from longer metric window times to avoid excessive up and down scaling while more stable systems can use shorter time frames.