Site Reliability Engineering ‘Key Concepts’ – SLO, Error Budget, TOIL and Observability
“What happens when a software engineer is tasked with what used to be called operations” – Ben Treynor, Google.
Around 2003, much before DevOps came into existence, Google created Site Reliability Engineers.
SRE is a discipline where the software engineering principles are applied to the infrastructure and operations problems to make the systems much more stable and reliable and to be able to ultra-scale as per the business needs.
The goals of Site Reliability Engineering are to create ultra-scalable and highly reliable distributed software systems.
SRE’s spend 50% of their time doing “ops” related work such as issue resolution, on-call, and manual interventions and spend 50% of their time on development tasks such as new features, scaling or automation. Monitoring, alerting and automation are a large part of SRE work.
The following are the SRE Principles:
- Operations is a software problem.
- SRE services are managed with Service Level Objectives.
- SRE practices aim at removing TOIL through automation.
- Automate as much as possible.
- SRE helps in reducing the cost of failure.
- SREs have skillsets of both Dev and Ops and share “Wisdom of Production” to Development Team.
What is SLO?
SLO or Service Level Objective is the availability criteria for the product and service. It is the expected goal for how well a service should operate. SLOs are very strongly related to the user experience. Once the SLOs are met, customer satisfaction will be high as users will be happy.
SLOs need to be set and monitored regularly as it is a key objective of SRE. There should be various SLOs for Products and Services. SLOs are always from the Customer point of view.
What is TOIL?
“TOIL is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical devoid of enduring value and that scales linearly as a service grows”. – Vivek Rau, Google.
Examples of Toil are Manual Release, physically connecting to infrastructure to check something, doing regular password resets, testing over and over, acknowledging the same alerts every day, creating users, manual resets, on-call response, extracting data, manual scaling of infrastructure, etc.
TOIL is bad because:
- It slows down progress
- Manual Work reduces the quality
- Career Progression slows down
- A never-ending list of manual tasks
- Burnout of resources
According to the Catchpoint SRE Survey Report 2019, the following are the most popular SLOs:
- Availability – 72%
- Response Time – 47%
- Latency – 46%
- We do not have SLOs – 27%
What is Error Budget?
“100% is the wrong reliability target for basically everything” – Ben Treynor
Error Budget means the amount of Time Budget we have where service can get affected. This is the time that is used to bring in new features or make architectural changes. If we tend to spend more than the budget, there has to be a consequence. One such consequence is to stop new features and get the system stable. So, all the post-mortem related backlogs are prioritized over the new features. SRE encourages to burn the Error Budget to Zero and use it strategically to balance velocity (speed) and availability (stability).
We need to be lean and have smaller batches as big changes can lead to higher risk and thus burning up of the error budget.
SLO – 99.9% Availability of the System
Error Budget – 43 minutes per month (0.1%) Within this time all new feature releases, patches, planned and unplanned downtime needs to be fit into these 43 minutes.
Consequence – If the Error Budget is used up, then the release of new features has to stop and user stories from the post-mortem related backlogs need to be prioritized.
What is Observability?
“Observability, as a noun, is a property of a system, it’s a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Therefore, if our IT systems don’t adequately externalize their state, then even the best monitoring can fall short” – Peter Waterhouse, CA
Observability is about having enough data that can be used to answer questions that are not already known. Observability required architecting is such a way so that the system can provide information to be able to help understand the health of the systems.
Observability is important because:
- Service growth is happening at a very rapid pace.
- Architectures are dynamic in nature.
- The workload of containers is increasing.
- There are service dependencies.
- High level of Customer Experience is very important.
There are many other concepts that need to look at for delivering the best of services to the customer.
Check out the Site Reliability Engineering Foundation Course from DevOps Institute, the USA through Xellentro.
~ Dr. Niladri Choudhuri