The Blog

In 2022, we are in the middle of Industry 4.0. It is a time of great opportunities for those who can adapt to the change and great peril for those who cannot.
The systems today are required to satisfy the needs of various market segments across the globe. For instance, the banking software can be used as a salary account, corporate account, Investment (fixed deposit, recurring deposit) and someone for livelihood as a retired pensioner. Bank accounts satisfy the needs of all segments.
Businesses should also aid the customer in deciding the final product/service required by discussing the requirements and demands of the customer. This is done by collaborating with the prospective customers to define the requirement. A problem statement is presented by the customer and hypothesis-driven collaborative development is carried out.
In DevOps approach, we have loosely coupled, distributed large scale systems. There are several changes in production every day. Canary Testing, Blue/Green Testing, Feature Toggle releases are common methods of testing. 

Reliability is a first-class citizen. Highly reliable systems increase the business. Chief Information Officers (CIOs) are evaluated based on the amount of business that was generated owing to the reliability of the system. 

IT systems are complex, which means there is a high possibility of things going wrong. This requires IT teams to improve service reliability and system resiliency. Observability and right automation are the key factors for increased efficiency, and resiliency. Anti-fragility is further implemented by SREs through Chaos Engineering. 

The role of Site Reliability Engineer (SRE) has become one of the fast-growing roles in the job market. SRE practices are a set of operational practices for managing services at scale. As Rinku Sachdeva of DevOps Institute mentions, “To support the growing need for SRE professionals with advanced skills, DevOps Institute is excited to release the SRE Practitioner certification as a follow-on to the popular DRE Foundation certification to validate knowledge and understanding of advanced SRE Practices, methods and tools for those focused on large-scale service scalability and reliability.”

This course will dive deep into various aspects as mentioned in the SREP Blueprint to give a clear understanding of the actual role of SRE. 

The following is the SREP Blueprint from DevOps Institute:

In this 3-days course, we will be looking at the following modules:

Module 1 – SRE Anti-patterns  

We will start with understanding the role of SRE as detailed below:


Module 2 – SLO is a proxy for Customer Happiness 

In this module, we will deep dive into SLO, SLI, Error Budgets and Error Budget Policies – its benefits, how to define them and how they impact the architecture and design.


Module 3 – Building secure and reliable systems  

As per the last SRE book of Google released in January 2020, a system is not reliable if it is not secure. DevSecOps also mentions that security is everybody’s responsibility. SREs also plays a big role in making systems secure. Topics of discussion include Large Scale Distributed Design, Designing for change in architecture and distributed systems, Design for Fault Tolerance, SRE’s role in design for security and resiliency. 

Module 4 – Full Stack Observability  

Observability is a key aspect of the SRE approach. In this module, we will deep dive into understanding what is Observability, how it is different from Monitoring and how to implement full-stack observability. We will have discussions around Google’s Golden Signal, 3 Pillars of Observability, Synthetic and End User Monitoring. 

Module 5 – Using Platform Engineering and AIOps  

We will discuss taking a Platform Centric View to understand how AIOps helps in having better resiliency. Other topics of discussion include how DataOps can help in the SRE journey, how to implement AIOps and Measurement for the success of AIOps implementation. 

Module 6 – SRE and Incident Response Management 

SRE approach to Incident Management is different from the traditional way of incident management. We will discuss SRE’s key responsibilities towards incident response, the new SRE way vs the Old ITSM Way of incident management. We will discuss further Google’s SRE Management framework and the roles, how to improve Incident Management using AI/ML.

Module 7 – Chaos Engineering 

In this module, we will see how to navigate the current day systems complexities and how SREs can practice Chaos Engineering using modern tools. Security Chaos Engineering related discussion will also be included. 

Module 8 – SRE is the purest form of DevOps 
Key principles of SRE, SRE Execution Models and Key Metrics will be discussed. Other topics of discussion include Cultural aspects, Psychological Safety, how to conduct proper Blameless Postmortem, how to document a postmortem from a google example and how to transform after implementing SRE.

What you need to know about the SREP examination:

Duration of the exam – 90 minutes

Bloom Level – 3 

Difficulty percentage – slightly difficult than the Foundation exams

Type of exam – Open book 

Medium – online 

Pass percentage – 65% 

Pre-requisite – Candidate should possess SRE Foundation Certification  

If you have any more questions, feel free to reach out to us at neetas@xellentro.com or anjalaim@xellentro.com