Find details about some of the Tanzu Service Mesh Service Autoscaler configuration properties, including recommendations on use of specific configuration values.
spec.scaleTargetRef.kubernetes
The three entries that identify a Kubernetes target reference are apiVersion
, kind
, and name
. Tanzu Service Mesh Service Autoscaler acts on a scalable Kubernetes object. You can optionally set the Kubernetes API version of the kind
for the apiVersion
property. An example apiVersion
is apps/v1
. The valid scalable Kubernetes object types are Deployment
, ReplicaSet
, and StatefulSet
, which is set as scaleTargetRef.kubernetes.kind
. scaleTargetRef.kubernetes.name
is the name of the Deployment, ReplicaSet, or StatefulState that the autoscaler will target. This scalable object needs to be in the same namespace as the autoscaler definition.
Ensure that there is only one autoscaling definition per Kubernetes object with the name set in scaleTargetRef.kubernetes.name
.
spec.scaleRule.enabled
Scale Rule has several configurable properties for autoscaling. To enable scaling on a namespace in which the application operates, set enabled
to true
. To disable scaling, set enabled
to false
. Even if scaling is disabled, scaling calculations are made and can be seen in the status output of the autoscaling custom resource, allowing for a dry run of the autoscaler or advisory mode.
spec.scaleRule.mode
Another configurable property is mode
. You can set mode to be EFFICIENCY
or PERFORMANCE
. The performance mode only scales up when resource usage exceeds the trigger.metric.scaleUp
target. It does not scale down even if resources are being underused. This mode attempts to ensure the system is always ready for spikes due to high user demand.
Efficiency mode attempts to prevent overprovisioning of resources by reducing the number of running service instances when resource usage is sufficiently under the trigger.metric.scaleDown
target.
spec.scaleRule.trigger.gracePeriodSeconds
To ensure stable operations, implement a grace period by setting a positive integer value for trigger.gracePeriodSeconds
. If a grace period is implemented and a scale-down action is required, it there will be a grace period during which no scale-down occurs following any scaling event. Because user experience is very important, and a delayed response to increasing demand is not desired, grace period does not apply to scale up situations. If the grace period is not set, it is set to 300 seconds by default.
If no grace period is desired, set it to 0 seconds. A low grace period value allows for more frequent changes in scaling because a scale-down happens after a shorter wait time. On the other hand, a system has more time to stabilize with a longer grace period. Setting the grace time should depend on how long it takes a service instance to come online and become effective. It should give enough time after a scaling-up or scaling-down event to be ready for a scale-down action. Generally, front-end web services, which are often stateless, can start accepting traffic quickly (within 30 seconds). However, a database replica node can take approximately 15 minutes to sync and prepare to store more data.
spec.scaleRule.instances.min, spec.scaleRule.instances.max
Under scaleRule.instances
, there are five configurable properties: min
, max
, stepsDown
, stepsUp
, and default
. The scaleRule.instances.min
and scaleRule.instances.max
limit how many service instances the autoscaler can set. The min
must be at least one service instance and less than what is set for scaleRule.instances.max
.
To ensure availability, at least two is recommended, but the minimum number of service instances should be whatever is required to maintain a positive user experience. A scaleRule.instances.min
of zero service instances is not allowed because it will shut that service down entirely. The largest allowed value for max
is 1000, and the lowest must be greater than the value of scaleRule.instances.min
.
You can determine a value that reflects the app’s needs with benchmark testing along with the knowledge of the cluster’s capacity. You can use a benchmark test to show how many of a cluster’s resources are used by various service instances. Based on that, you can make an estimate of how many service instances can run on a cluster. Be sure to test all the apps that share the cluster. Other factors to consider are dependencies that bottlenecks in the service. The max
should be a number where having any more instances will not be helpful due to the dependency bottleneck.
spec.scaleRule.instances.stepsUp, spec.scaleRule.instances.stepsDown
The next properties under scaleRule.instances
are stepsUp
and stepsDown
. Normally, if these are not set, the autoscaler attempts to scale up and down roughly in the proportion that the resources are being used. The resource usage proportion is approximately the average usage of each service instance divided by the resources that have been requested for each service instance (that is, the requested resources). When these properties are set, whenever scaling occurs, it always increases by stepsUp
or stepsDown
values, as appropriate, if the min
and max
limits are respected.
Generally, for stateless services, use proportional control. However, ensure slower scale-up by setting stepsUp
to a low number for services that have long start-up times. Another reason to limit scaling with stepsUp
might be high license costs or having to provision more and expensive storage for each scale-up event. As for stepsDown
, it might be useful to limit how fast an app scales down. Another reason for using stepsUp
and stepsDown
is for when a metric, like latency, is used. Because latency can change drastically with time, it is difficult to calculate an accurate proportion to scale to.
spec.scaleRule.instances.default
The final property under scaleRule.instances
is default
. This value must be greater than or equal to the scaleRule.instances.min
and less than or equal to scaleRule.instances.max
. Understanding default
means understanding how metrics are used to determine scaling.
Using CPU usage as an example, desired service instances are calculated by how much CPU each service instance is using. Thus, if the average CPU usage of all the service instances exceeds the scaleUp target
, or is below the scaleDown target
, a desired number of service instances are determined. However, if fewer than one third of the service instances’ CPU metrics can be read, scaling has insufficient information and cannot accurately calculate the desired service instance count.
In this situation, and if the current number of service instances is less than default
, a scale-up occurs to meet the default number. If the number of service instances is greater than default
, a scale-down does not occur to prevent disruption to user experience. If default
is not set, in the case of insufficient metrics information, no scaling action is taken. Consider default
as the minimum number of service instances to have if there is insufficient metrics information.
spec.scaleRule.trigger.metric.name
There are four configurable properties that are related to metrics under scaleRule.trigger.metric
: name
, scaleUp
, scaleDown
, and windowSeconds
. Metrics are the measurements of resource usage or quality of service that autoscaling can be based on. The most useful metrics to determine the number of needed service instances are CPU and memory usage. These metrics correspond to the spec.scaleRule.trigger.metric.name
values of CPUUsagePercent
, CPUUsageMillicores
, MemoryUsagePercent
, and MemoryUsageBytes
. The metrics ending in “Percent” are derived from the resources.requests
for every container specified in the Kubernetes deployment manifest for the application being scaled. The reason for supporting both percentage and absolute values is to have a more stable (less “jittery”) number of service instances during scaling.
CPU and memory are preferred metrics on which scaling is based because they are direct measures of the resources needed to run services. However, other metrics are also available for basing scaling targets. They are RequestsPS
(requests per second), Requests
, p50Latency
, p90Latency
, and p99Latency
.
Requests
is the number of requests that have occurred in the metrics window time frame (see the spec.scaleRule.trigger.metric.windowSeconds section). Theoretically, as the number of requests increase, scaling up needs to occur to keep up with demand. If developers are familiar with requests and the number of service instances required to serve them, this can be the metric to base scaling on. The latency metrics are measured in milliseconds in different percentiles, where the first three characters of p##Latency represents the percentile. Fifty percent of requests have lower latencies (faster response times) than the value of p50Latency
. Ninety percent of requests have lower latencies than the p90Latency
value. Ninety-nine percent of requests have lower latencies than the p99Latency
value.
For scaling, as latencies increase, consider increasing the number of service instances serving requests. Be aware that high latencies may occur not because of insufficient service instances running, but rather because of dependencies external to the cluster. In such a case, the autoscaler attempts to scale up without success, eventually reaching the maximum number of service instances and risking wasting of resources. Only apply latency metrics to services without dependencies. Because latencies and requests per time can change drastically moment to moment, it can be difficult for the autoscaler to calculate an accurate proportion with which to scale. If these volatile metrics are used, use stepsUp
and stepsDown
instead of proportionate scaling.
spec.scaleRule.trigger.metric.scaleUp, spec.scaleRule.trigger.metric.scaleDown
The next two configurations in metrics are scaleUp
and scaleDown
. Unlike other autoscalers that have a single target resource level that services aim to achieve, Tanzu Service Mesh Service Autoscaler has two thresholds on the service instance level at which scaling up and down are triggered. You can set the scaleUp
and scaleDown
values close together to define a narrow band within which service instances are expected to run. In such a case, depending on the stability of the measured metrics, the autoscaler might need to perform frequent scaling actions to maintain the small range of desired operation levels. In contrast, you can set the scaleUp
and scaleDown
values far apart to have a wide range of operation levels in which scaling is not required.
The units of the scaleUp
and scaleDown
values must match the units of the metric set in metric.name. For example, if MemoryUsagePercent
is used, the scaleUp
and scaleDown
values must also be a percentage. Percent values are in relation to the resources.requests
for every container specified in the Kubernetes deployment manifest for the application being scaled. Continuing the example, if there are 10 current service instances, scaleUp
is set to 80, and on average, each of the service instances are using 90%, a scale-up occurs (if max service instances is not exceeded). Similarly, if scaleDown
is set to 30 and the service instances are on average using 25%, a scale-down occurs, if the mode is set to efficiency
and desired replicas are at or greater than the min
number of service instances.
spec.scaleRule.trigger.metric.windowSeconds
The last configuration is windowSeconds
. The default setting is 600 seconds (10 minutes), but you can set it between 60 and 3600 seconds. This setting is the length of time in which metrics are gathered before they are used to determine the average metric level that the service instances are consuming. The average is compared to the scaleUp
and scaleDown
threshold values to determine if scaling should occur. Systems with high volatility in resource usage might benefit from longer metric window times to avoid excessive up and down scaling while more stable systems can use shorter time frames.