The Blog

We have been used to seeing developers developing the software and not concerned with monitoring, which is supposed to be an Operations work. With the DevOps initiatives, this view has changed and Developers also need to consider the operations aspects starting from the design and down the CI/CD pipeline, extending to the post-deployment running of the service.

The second way of DevOps is about the “Shifting Left”, which is also about the shorter feedback loop. This means that we want to get as much information as we can and use the “Wisdom of Production” as early in the day of the lifecycle. To be able to achieve this we need to consider the monitoring of the environment and getting that data as early as possible in the lifecycle.

The more the data, the more the chances of noise. We need to get as much data and use the most important ones relevant to our decisions. We are now using Monitoring which is not about just collecting data, metrics and event traces but also to make us understand the health of the system. Thus, making systems Observable!

Let us now understand what is Monitoring and what is Observability.

As per Peter Waterhouse of CA, “Monitoring is a verb; something we perform against our applications and systems to determine their state. From basic fitness tests and whether they are up or down, to more proactive performance health checks. We monitor applications to detect problems and anomalies. As troubleshooters, we use it to find the root cause of problems and gain insights into capacity requirements and performance trends over time.”

As per a blog by Riverbed, “Monitoring aims to provide a broad view of anything that can be generically measured based on what you can specify in external configuration, whether it is time-series metrics or call stack execution.”

However, in the DevOps world, we want to deliver faster, Stabler, safer and happier. Today we deliver a large number of changes and deploy them. There are a large number of Apps running and we need to ensure that MTTR is as short as possible. Every minute of downtime is a loss of a large number of business transactions and money. Monitoring only does not help us to achieve this. We need to do something more than that. What we need is Observability.

Observability is the ability to infer internal states of a system based on the system’s external outputs.

Charity Majors, CEO of stated in her tweets to explain observability as follows

Observability, short and sweet:

– can you understand whatever internal state the system has gotten itself into?

…just by inspecting and interrogating its output?

…even if (especially if) you have never seen it happen before?

— Charity Majors (@mipsytipsy) November 26, 2019

This means that Observability is a measure of how well we can understand the internal states of a system from the knowledge of the external outputs. The importance is that of the point that we get to understand things that have never happened, the unknown-unknowns.

As Peter Waterhouse states, “Observability is about how well internal states of a system can be inferred from knowledge of external outputs. So, in contrast to monitoring which is something we do, observability (as a noun), is more a property of a system.”

With the use of cloud-native, containers, microservice architectures, the old ways of monitoring do not enable us to scale. We need to use tools to better understand the application’s inner working and performance in the distributed systems and the CI/CD pipeline.

As mentioned by Charity Majors, “Observability requires methodical, iterative exploration of the evidence. You can’t just use your gut and a dashboard and leap to a conclusion. The system is too complicated, too messy, too unpredictable. Your gut remembers what caused yesterday’s outages, it cannot predict the cause of tomorrow’s.”

Tooling is an important aspect of Observability. Many tools are helping us to externalize the key events of an application through logs, metrics, and events. Tracing can be one such example. We can use Kubernetes to activate the metrics capture and analysis during a containerized application deployment to have observability.

Many open-source tools are available for implementing observability like collected, stated, flaunted, Zipkin, Kaeger, OpenTracing, Open Telemetry, Semantic Loggic, etc. Observability means we need to have a watch on all application components from mobile and web front-ends to infrastructure. These used to come from various data sources. Now the IT System is more complicated and we need the applications and codes to be made to bring out the information and architected in that manner.

It is also important to look at the Human aspect. People need to use this information in designing, developing and testing their applications. It has to be used in the right context. We need to have modern monitoring methods to be built into the deployment pipeline with minimum complexity. An example given in the blog by Peter Waterhouse is that we could increase observability by establishing server-side application performance visibility with client-side response time analysis during load testing — a neat way of pinpointing problem root cause as we test at scale.

We need to understand the objective and then have the team implement the objective with the relevant tools and use the information to deliver the service at the best possible way.

Observability is important to the organization because of:

  • Service is growing rapidly.
  • Architectures used today are more dynamic.
  • Dependencies between services are complex.
  • Better Customer experience is a key consideration.

We need to use SLO, SLI, and Observability together to deliver the best service. An example as mentioned in the SRE Foundation Course by DevOps Institute is as follows:

  • SLO’s are from a user perspective and help identify what is important.
  • g. 90% of users should complete the full payment transaction in less than one elapsed minute.
  • SLI’s give detail on how we are currently performing.
  • g. 98% of users in a month complete a payment transaction in less than one minute.
  • Observability gives the use of the normal state of the service.
  • 38 seconds is the “normal” time it takes users to complete a payment transaction when all monitors are healthy.

To know more about the subject of Observability and other SRE topics, connect with us and get your SRE – Foundation Certification.

Leave a Comment