Service level objectives (SLOs) provide a formalized way to describe, measure, and monitor the performance, quality, and reliability of microservice applications. SLOs provide a shared quality benchmark for application and platform teams to reference for gauging service level agreement (SLA) compliance and continuous improvement.

An SLO describes the high-level objective for acceptable operation and health of one or more services over a length of time (for example, a week or a month). Operators can specify, for example, that a service or application should be healthy 99 percent of the time. An SLO of 99 percent permits a service to have an Error Budget of 1 percent of the time which means to be “unhealthy” 1 percent of the time, that allows for realistic downtime, error cases, planned maintenance windows, and service upgrades. Teams can specify which performance characteristics and thresholds are key to the health of their applications. Multiple SLOs can be defined for a single service, reflecting the reality of Quality of Service (QoS) contracts between different classes of end users.

An SLO consists of one or more service level indicators (SLIs). SLOs defined using a combination of SLIs allow teams to describe service health in a more precise and relevant way. SLIs capture important low-level performance characteristics for a particular service. VMware Secure App IX collects SLI metrics on 10 second intervals for every service instance that is part of the mesh. An example of an SLI would be 99 percent of successful requests respond with latencies faster than 350 ms (99th percentile latency < 350 ms). Another example is an SLI set for a service that responds with error codes for fewer than 0.1 percent of requests (error rate < 0.1%).

VMware Secure App IX incorporates SLO and SLI measurements by displaying them in real time through its user interface.

Attention:

A goal of 100% SLO is unrealistic. In spite of automated health checks and fast failover, there is a nonzero probability that one or more components will fail simultaneously, resulting in a service with less than 100% uptime.