The Blog

The Catchpoint SRE Report 2020 is out. This was supposed to be out at SRE_CON. However, due to the pandemic, it got delayed and gave the opportunity to have data for both pre-pandemic and after the new normal way of working changes to be shown in the report. As the report has got released yesterday, I would like to take a quick look at the report and express my views on it. This year there were a total of 600 respondents.

The key takeaways mentioned are:

Key Take away 1 – Observability Components Exist; Observability does not – The following is one of the results from the survey:

The above result showing the focus on monitoring and alerting as the largest area (93%) shows that organizations are still going with the traditional way of managing IT Services. 53% says Observability. Understanding if the respondents meant the same about implementing “Observability”.

The 2nd figure shows that the metrics are still oriented towards the components and not the customer focus. The highest is the metric of Error Rate rather than MTTR and End User Response Time.

The next point mentioned here is about extending the monitoring end-to-end rather than looking at only the internal ones. While we are most of the time use various vendors and service providers in the complete value stream, the end-user monitoring is not evidenced. Only 11% of respondents mentioned that the automated workflow extended to 3rd party and 37% mentioned that 3rd parties are the cause of increased incidents.

The next figure shows that only 39% of respondents mentioned that they are using multi-step transactions for Synthetic Monitoring. This only will be emulating the complete user journey. The others are looking at only parts of it from the component point of view.

To the question of the extent of automation implemented by SREs for comprehensive infrastructure and application performance monitoring, only 44% mentioned that they have. Rest are again at the traditional level of infrastructure monitoring and not on the application level.

On the question of health monitoring at Service Level, only 43% said that they have while the balance 57% do not have. Again, it shows that the focus still is on the deliverables and not the outcome.

Key takeaway 2 – Heavy Ops Work Load Comes at a Cost – The more we bring the “Wisdom of Production”, we will be able to deliver a more reliable system. This will reduce costs due to rework, scrap, and extensive support costs.

On the question of the percentage of SRE time being used for Development work, only 14% mentioned that they have 50% or more time used. This means that still, SREs are continuing to TOIL.

An interesting question asked due to the current scenario of #WFH.

The #WFH has not really made much of difference on the shift to more Dev related work. 83% still does the same or more Ops related work.

The classical question of who does SRE work for the organization shows the following:

While 46% says there is dedicated SRE, the figures down the line show more focus on other than SRE work. Thus, most implementations are not really SRE.

The following figures show that the focus is still more Ops and not really covering the other aspects that SRE is supposed to do.

The SRE and DevOps Team Relationship and the SRE Metrics relating to Business Impact is as follows:

While we see that only 41% is part of the same team rest 59% is not. We see that 82% is measuring Customer Satisfaction but our previous response shows that the focus is more internal on the deliverables and not outcomes.

There is a big scope for SRE to provide the Wisdom of Production and come as Consultant from the Design stage itself. This requires a common motivation/incentive and a shared outcome.

Key takeaway 3 – Shift to Remote Creates Opportunities and Challenges – An interesting set of new challenges and opportunities are seen due to the #WFH scenario. This is more to do with the “People” aspect of the SREs.

The following graphs show the percentage of work as toil. However, this year the addition is the toil from the point of view of work-life balance, shown in the graph on the left below.

60% of respondents have mentioned that work-life balance is the biggest source of Toil to be handled in the #WFH situation.

The typical sources of Toil, as was in the 2019 Report this year shows

Another interesting question is on the automation of Incidents. This is an area that needs to be focused by SREs as Incidents are one of the biggest sources of Toil. The figures are not very encouraging. There is a long way to go.

The use of the cloud and its features to do SRE work better was expected. However, the question and the response below shows the contrary.

Training and certification are still not adequate and the budget for it needs to be increased.

Key takeaway 4 – Future of SRE is Remote and Bright – Post the pandemic what percent of SRE work will be from home and what percent is from office gave a feel that distributed working will be there to stay.

The proactive to reactive work in the #WFH situation is showing that it is still Reactive and very less Proactive.

The information on the increase in incidents or not and what is the biggest reason for incidents is being captured. This shows a surprising fact that 7% said they do not know if incidents have increased or not. Higher traffic is the biggest source of incidents in the new scenario. 3rd Party issues remain significant.

Another important information is that 66% mentioned that MTTR was the same and another 14% said it was more effective. So, work can go on well from home too with more fine-tuning of the process and use of technology, with the learning from the situation.

The point on Chaos Engineering has been looked at with the following question:

It shows that there are very few instances of DR being done, leave aside the Chaos Engineering, while this is an important aspect of SRE.

Conclusion

My own observation of the report is that SRE is still emerging. There is a lot of confusion in the understanding of the role of SRE and thus the figures are still showing that organizations are still doing what Operations used to do.

The other area which gets highlighted is automation. Automation has increased but there is no correlation to the outcome to make systems more stable and reliable.

This report focuses a lot this year on the Pandemic situation and people aspect.

Observability is an important aspect of SRE. However, this report brings in Observability for many things.

The data points show the need for a shift in the culture and mindset, which is not addressed. Blameless postmortem as a culture change could have been explored as it impacts the SRE work.

The skills related to SRE needed to be addressed. There is just one mention of it in the report. However, SRE related skills is an area that would have helped practitioners.

To me, the report needs to be more exhaustive and it would have been more helpful to practitioners if it could project the maturity of SRE from Starters to more mature organizations where they are scaling.

~ Dr. Niladri Choudhuri

Leave a Comment