As per the book Site Reliability Engineering book – ‘Site Reliability Engineering is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.’
The goal of Site Reliability Engineering is to create an ultra-scalable and highly reliable distributed software systems.
The Principles of Site Reliability Engineering are:
Operations is a Software Problem – Software engineering principles like designing and building is used to solve problems rather than maintaining and operating
Service Level Objectives – Services are managed to the SLO (Service Level Objective). SLO is from the customer’s point of view. Some action has to be taken if SLOs are breached. So SLOs needs consequences if they are violated. SLOs are made to make the user experience better
Toil – Any repetitive, mundane operational task is bad. It should be automated. The ‘Wisdom of Production’ is used to design better systems. SREs must have time to make tomorrow better than today
Automation – Automate whatever can be automated and help remove toil. Infrastructure as code, Configuration as code are to be done. We need to be careful to fix bad processes before automating them. SREs also has the ability to regulate the work
Reduce Cost of Failure – Late problem detection causes higher cost. SRE tries to reduce MTTR (Mean Time To Repair). Canary Testing, Smaller Pieces of Work helps in faster detection and recovery
Shared Ownership – SREs share skillset with the development team and has operations related skills. Hence, the Silo is broken between Dev and Ops. This requires some organization changes in structure, performance appraisal from individual to team based and also need at least T-shaped skills.
Another concept that SRE uses is ‘Error Budget”. If there is a breach of SLO, there has to be a consequence. For example, if there is 1 Million transactions per month and we have a 99.9% SLO, it means that we can have 1000 transactions in a month to fail. This is error budget. This means that we can do new release, patches, modification, etc., which can result to a maximum of 1000 transactions failing due to those. If there is any more, we may need to stop new releases till we make the system stable.
According to the SRE Survey 2019 of Catchpoint – the most popular SLOs are:
We don’t have SLOs
SREs use 50% of their time for Operations work and 50% on Development work. Google also states that no one will work more than 25% of their time in “On-call”. Monitoring is important in SRE but Observability is more important. Externalizing all the outputs of a service allows us to infer the internal state of that service thus making it observable. Being observable means being proactive as monitoring is only after the event has occurred.
SRE requires automation. The following can be areas of automation driven by SRE:
Infrastructure as Code/Configuration as Code – Tools like Terraform, AWS CloudFormation, Puppet, Chef, Ansible, Saltstack, Docker, etc.
Automated Functional and Non-Functional testing in production – Tools like Selenium, Cucumber, Jasmine, Mocha, Zephyr, Mockito, JMeter, SonatypeNexus Lifecycle, SoapUI, WhiteSource, Veracode, Nagios, etc., can be used
Only Versioned and Signed artifacts are deployed – Tools used are Nexus, Artifactory
Automation helps better observability – Tools used are OpsGenie, Nagios, Dynatrace, AppDynamics, Prometheus, Splunk, LogStash
Helps in future growth planning easier – Tools that can be of help are Amazon Cloud Auto Scaling, Kubernetes Pod Scaling, Amazon Cloud RDS, NoSQL-type databases like MongoDB, Couchbase, Cloud APIs
Antifragility and Chaos Engineering – Tools like Chaos Monkey, PagerDuty, VictorOps, Squadcast. Fire Drills also need to be done.
Automation helps making things consistent, testable, production ready easily. It also is more secure and auditable. It also helps recreating errors easier. Cost of change is less and Regression risk is reduced. Automation helps in automated deployment thus making it more secure and less vulnerable. It reduces the dependency errors and helps to identify vulnerabilities faster and easier. Automation helps in reducing MTTR and helps with protective monitoring. Automation helps in reducing TOIL and thus reducing Total Cost of Ownership. Various risks like availability, integrity, are mitigated.
SRE does not stand alone. It works with DevOps and Lean, IT Service Management and Agile.
DevOps India Summit is Asia’s biggest DevOps Conference that has been making landmarks every year. With the ongoing pandemic not intending to end anytime soon, the DOIS21 Conference is all set to be conducted ...
We have just completed an interesting and unexpected year, 2020. We have already started seeing many predictions being made by experts and practitioners from various parts of the globe.As part of the initiatives from ...
While everyone is talking about Industry 4.0 and Digital Transformation and IoT/AI/ML/Data Science, having the right understanding is important. Based on the above quotes and the snapshot of one of the slides from a ...