Find reference information about SLO configuration in Tanzu Service Mesh, including latency percentile metrics, error rate, and error budget.
Service level objectives are defined by service level indicator (SLI) thresholds and the percent of time that a service must perform under those threshold limits. If the SLIs start meeting or exceeding the threshold levels, the error budget starts to become depleted, or if already depleted, goes into the negative. You can determine the error budget by setting a percent of the time that SLIs are allowed to meet or exceed threshold levels. When the error budget is in the negative, the SLO is violated.
Metric |
Units |
Value Type |
Notes |
---|---|---|---|
p99 latency |
millisecond |
positive whole number |
Measures the latencies of all requests, including requests that have resulted in errors (returns of 500-level or 400-level HTTP status codes). |
p90 latency |
|||
p50 latency |
|||
error rate |
percent |
positive number |
Percent of requests that returned 500-level or 400-level HTTP status codes |
It is much faster to send back a response that an error has occurred compared to responding with data from a large file. Therefore, it can be misleading to base quality of user experience on latencies alone. Be sure to include error rates as SLIs in SLOs. That way, if both latencies and error rates are low, it is a good indication that the service is healthy.
However, having both latencies and error rate as SLIs in an SLO is not perfect. Responses that occur for client (4xx) errors are included in the error rate metrics. For example, if a user attempts to go to a page that doesn’t exist by manually editing the URL path, the application responds with a 400-level HTTP status code. Such user errors are not failures of the application to provide good service, but are included in the error rate and are therefore counted against the SLO.
Latency Percentile Measurements and Thresholds
There are three different latency percentiles available for use: the p99, p90, and p50 latencies. Each latency value is the time in milliseconds that it takes for a percent of the requests to complete. As an example, consider a p90 latency threshold of 800 ms. The threshold has been set such that ninety percent of the requests must complete faster than 800 ms. That means that ten percent of the requests complete in 800 ms or more.
To continue the example, consider the following 10 data points. Each data point represents the time it took for one request to complete in milliseconds.
170, 81, 68, 67, 703, 77, 810, 84, 91, 90
To determine the p90 latency for this data set, first sort the data set. Here is the data set in ascending order:
67, 68, 77, 81, 84, 90, 91, 170, 703, 810
Then highlight the fastest 90 percent of the requests.
67, 68, 77, 81, 84, 90, 91, 170, 703
, 810
The p90 latency measured for the dataset, or 703 ms, is in bold. That means that 90 percent of the data set falls at or below 703 ms.
If a threshold of 800 ms was set for a p90 latency SLI, in the above example, this dataset would meet the Service Level Objective for the period in which the data was collected.
Let’s change the dataset by one value. This dataset will be considered a violation of the SLO:
67, 68, 77, 81, 84, 90, 91, 170, 805
, 810
The above dataset violates the SLO because 90 percent of the requests do not take under 800 ms. Only 80 percent of the requests fall under the threshold. If a p50 latency SLI was defined, instead of a p90 with a threshold of 800 ms, this dataset would have not violated the SLO.
You can define multiple SLIs for a single SLO. For example, you can have both and a p90 latency SLI and a p50 latency SLI. For the second dataset, if both the p90 and p50 latency thresholds are at 800 ms, the SLO would be considered in violation because at least one of the SLIs exceeded its threshold.
For more examples and an explanation of the latency percentiles, see Best Practice 1, Best Practice 2, Best Practice 3, and Best Practice 4.
The Error Rate
Users interact with Tanzu Service Mesh through the HTTP protocol. Requests to the application result in responses that are grouped into five categories: informational (100 level), successful (200 level), redirects (300 level), client errors (400 level), and server errors (500 level).
Tanzu Service Mesh considers any responses in the client or server error categories to be errors. These error responses correspond to an HTTP status code range of 400–500.
Error Budget and Availability
You can define the error budget by setting the amount of time that SLI thresholds can be met or exceeded. If you expect that 99.999 percent of the time the service will be under the set SLI thresholds, that gives an error budget of 0.001 percent of the time. That means that in a month (assuming 30 days in a month) a service would still be considered healthy if a user experiences latency and/or error rates exceeding their SLI thresholds for nearly 26 seconds. 26 seconds is the error budget for the month. Here is the calculation that is used to determine this value:
0.001% ÷ 100% x 30 days ✕ 24 hours/day ✕ 60 minutes/hour ✕ 60 seconds/minute = 25.92 seconds.
To continue the example, let’s say we set an SLI with a p99 latency of 100 ms. For 15 seconds, a user gets responses that took over 100 ms to return. That means that the error budget has been reduced by 15 seconds and only 11 seconds of the error budget remain.
26 seconds – 15 seconds = 11 seconds
If the high latencies continue and there is no longer an error budget (0 seconds remaining), the error budget starts to go into the negative. If an error budget ends up being -5 seconds for a window of a month, that means that in the next window, there is only 21 seconds in the error budget. The greater the negative value of the error budget, the longer it will take to recover. If a service cannot be healthy long enough, it might not be possible to recover the error budget.
What is the purpose of error budgets? Your SLOs don’t have to be perfect. They don’t need to meet SLIs 100 percent of the time for users to have a good experience with a service. Instead, define a budget where some error or delay in the service is allowed. With an error budget, teams can balance the priorities between developers and operations, between building new features and maintaining existing ones.
An error budget that becomes depleted indicates to a team that they should focus on making existing features more robust and resistant to failure. On the other hand, an error budget that doesn’t get depleted indicates that the team can focus on new feature development. Having an error budget and an SLO provides an objective measure that can signal where the focus should be in the development and maintenance of applications.
Burn Rate Calculation
Burn Rate = (CurrentRemainingErrorBudget- RemainingErrorBudget_before_15min)/15.
The default value for Burn Rate calculation across the dashboard is 15 minutes (in the pie chart and table grid). The table below explains how to calculate burn rate:
S.No |
Components |
SLO1 |
SLO2 |
Total |
Remarks |
---|---|---|---|---|---|
1 |
Total minutes in SLO period |
43800 |
43800 |
43800 |
Total minutes in a month |
2 |
Actual permissible error budget in SLO period in minutes |
180 |
250 |
430 |
Total Error budget |
3 |
User-selected time range(in minutes) |
15 |
15 |
15 |
Implicit |
4 |
Actual depletion in minutes within the user-selected time range |
1 |
10 |
11 |
|
5 |
Permissible error budget depletion rate per minute |
0.004109589 |
0.00570776 |
0.00981735 |
R2/R1 |
6 |
Actual Burn rate within user-selected time range |
0.066666667 |
0.66666667 |
0.73333333 |
R4/R3 |
7 |
Is actual burn rate healthy? |
N |
N |
N |