The Blog

Definition of Security Chaos Engineering – “The identification of security control failures through proactive experimentation to build confidence in the system’s ability to defend against malicious conditions in production”.

With Industry 4.0 the use of more digitalization, AI/ML, IoT, Data Science is providing more opportunities for the hackers to attack. Data is the most valuable and is most vulnerable. The first paragraph of the report starts with the statement – “Information Security is broken”. As mentioned, more and more people are entrusting us with their lives, and we are failing them. As per DevSecOps, Security is everyone’s responsibility. Year after year we see the same type of attacks. If we just look at the OWASP Top 10, the first 2 are there since 2013. As per the report, “Security Industry keeps chasing after the shiny new Tech and maybe incremental improvement in the process”. As per the report “Fundamental shift in both philosophy and practice is necessary. Information Security must embrace the reality that failure will happen. People will click on wrong thing. Security implications of simple code changes won’t be clear. Mitigations will accidentally be disabled. Things will break”.

“Hope isn’t a strategy. Likewise, perfection isn’t a plan”.

1st Chapter

What is Security Chaos Engineering all about?

“Security chaos engineering is about increasing confidence that our security mechanisms are effective at performing under the conditions for which we designed them. Through continuous security experimentation, we become better prepared as an organization and reduce the likelihood of being caught off-guard by unforeseen disruptions”.

Two core guiding principles mentioned in the report are:

  • Expect Security Controls to fail and prepare accordingly
  • Do not attempt to completely avoid incidents but instead embrace the ability to quickly and effectively respond to them

Benefits of Security Chaos Engineering

  • Reduction of Remediation Cost
  • Disruption to end users
  • Reduce stress level during incidents
  • Improvement of confidence in production systems
  • Understanding of systemic risk, and feedback loops

The key takeaways mentioned are:

  • Resilience is the foundation of chaos engineering
  • Only focusing on robustness leads to a false sense of security
  • Security Chaos Engineering accepts that failure is inevitable and uses it as a learning opportunity

The 2nd Chapter of the book talks about an interesting point – understanding how attackers make choices during their operations is essential for informing security strategy at all stages of SDLC. It helps in better decision making and also helps eliminating excess engineering efforts to stop a niche threat and help to focus on low hanging fruits in the system which attackers will compromise first. Attackers also have to think of ROI. This attacker ROI is commonly known as “attacker math”. This helps in determining the right set of security controls. In Security Chaos Engineering, this attacker math provides a blue-print for the type of game-day scenarios to be conducted.

It also mentions about Decision Trees for Threat Modeling. A threat model helps in identifying the characteristics of a system, product, or feature that could be abused by an adversary, and sets up systems for security success when implemented during the design phase. Threat model covers all issues relevant to the system in discussion including ones which already has mitigations, while context may change from time to time. Tactically, threat model describes the system architecture in detail, the current processes used to secure the architecture and any security gaps that exists. A threat model should include asset inventory and inventory of sensitive data, as well as information about connections to upstream and downstream services. Decision trees is a form of threat model that incorporates attacker ROI. This report gives an example of a practical decision tree. Decision Tree helps to outthink the adversary. The following are the key questions to be asked, as mentioned in the report:

  • Which of our organizational assets will attackers want? Enticing assets to attackers could include user data, compute power, intellectual property, money, etc.
  • How does the attacker decide upon their delivery method, and how do they formulate their campaign?
  • What countermeasures does an attacker anticipate encountering in our environment?
  • How would an attacker bypass each of our security controls they encounter on their path?
  • How would an attacker respond to our security controls? Would they change course and take a different action?
  • What resources would be required for an attacker to conduct a particular action in their attack campaign?
  • What is the likelihood that an attacker will conduct a particular action in their attack campaign? How would that change based on public knowledge versus internal knowledge?

The report also suggests a simple prioritization matrix to start on the Threat Modelling as below:

The report also talks about picking up the decision tree during incident retrospective. Decision trees are also useful during testing and experimentation.

Key takeaways in this chapter are:

  • Attackers will choose the easiest, low-cost path when possible
  • Decision trees allow us to conceptualize and visualize attack paths
  • Raising the cost of attack is the key to security success
  • Leverage feedback loops for continuous refinement

The 3rd chapter talks about an important point – Security Programs are enabling the business rather than slowing it down. This shows a difference between the Security Chaos Engineering and Security Theatre (traditional way of doing security activities).

SCE Security Theater
Failure is a learning opportunity Increases stringent policies and technology to reduce bandwidth for human failure
Accept that failure is inevitable, and blame is counterproductive Blames humans as the source of the problem
Uses experimentation and transparent feedback loops to minimize failure impact Uses policies and controls to prevent failure from happening
Security team operates collaboratively and openly Security team operates in silo
Incentivizes collaboration, experimentation, and system-level improvements Incentivizes saying “no” and narrow optimization
Creates a culture of continuous learning and experimentation Creates a culture of fear and mistrust
Is principle-based and defaults to adaptation Is rule-based and defaults to the status quo
Fast, transparent, security testing Manual security reviews and assessments

The key takeaways are:

  • Security theater optimizes for gatekeeping, not continuous improvement
  • Stability and speed aren’t at odds, and heavy security approvals don’t promote stability
  • SCE approval patterns favor localized decision-making and light-weight advisory

The 4th chapter talks about Democratizing Security. An example here is having a former hacker to be part of the team who does the hacking and then changes the hat to make the system more secure from the hacks possible. “Democratization of Security” means that security efforts are explicitly neither isolated nor exclusive. It must serve all stakeholders and involve participation of all stakeholders. It suggests a Security Champions program as part of SCE. It also explains “Alternative Analysis”. Here fundamentally, red teams should be used to challenge the assumptions held by organizational defenders (blue teams). SCE can look at automated Red Teaming. The Security Champions Program promotes the continuous exchange of information about each product, direction, and systems interconnectivity. It also spreads knowledge around security practices to product teams, which the security champion ensures are understood and followed.

The key takeaways from this chapter are:

  • Considering alternative perspectives supports stronger security outcomes
  • A rotational red team program can democratize attacker math
  • A Security Champions program empowers engineering teams

Chapter 5 talks about “Building Security in SCE”. “Security failure starts within the design and development of software and services. Failure is a result of interrelated components behaving in unexpected ways, which can – and almost always do! – start much further back in system design, development processes, and other policies that inform how our systems ultimately look and operate”. This goes in details of explaining how failure happens in Failures in Containers and Image Repositories, Failures in Build pipeline and, also mentions the Chaos tests relevant to each question answered.

The key takeaways are:

  • Pre-production security isn’t just about vulnerabilities
  • Conduct experiments early in microservices environments and in build pipelines

Chapter 6 talks about Production Security in SCE. “Production Security” starts at software deployment phase and includes the ongoing operation of software and services. Here discussion is on some of the core characteristics in infrastructure that create security by design, examples of failure in production systems and how failure can inform both better infrastructure design as well as strong detection and response program. Chaos Engineering has to be done in production environment to understand the actual vulnerabilities. Here also it goes into deep dive of what Chaos tests to be performed.

The key takeaways are:

  • Production systems are new engine of business
  • Uncovering unknown relationships

In Chapter 7, it talks about the journey into SCE.  It mentions that:

  • Validate known assumptions
  • Crafting Security Chaos Experiments
  • Experiment Design Process
  • Document Steady State
  • Design Hypothesis
    • What do we expect to happen in the experiment?
    • What are the specific criteria and characteristics that support the hypothesis?
    • What is the scope of this hypothesis?
    • What are the variables?
    • What are our assumptions?
  • Contain the blast radius
  • Fallback Plans
  • Notify the organization
  • Plan your game days and execute the experiment
  • Measure the impact of each failure
  • Validate the feedback expected from your security and Visibility Tools
  • Automate the experiment for Continual use
  • Game Days

This gives some Use Cases. SCE helps in gaining new insights. It also talks about the tool ChaoSlingr, the first one developed by Aaron Rinehart.

It also talks about tools like Cloudstrike. The key takeaways are:

  • Focus on validating known assumptions
  • Get started with SCE game days
  • A valuable tool for validating and bolstering security

The last chapter is with a few case studies.

My Take:

This report is excellently done by the authors. It gives a very clear understanding of Security Chaos Engineering and the details are good to help start baking security into the whole lifecycle. The chaos tests are quite helpful to implement. This report also gives the leaders a perspective of Security. Today, security is the biggest concern on the mind of all CEOs and CIOs. There is not enough spend on the security as part of the regular activity and this report puts stress on that. This also talks about the Security SMEs to change their way of working and come as an enabler than a gatekeeper.

I recommend this to every developer, tester, Security SMEs and Leaders.

Thanks again to Aaron Rinehart and Kelly Shortridge on this wonderful book from O’Reilly – Security Chaos Engineering – Gaining Confidence in Resilience and Safety at Speed and Scale.

~ Dr. Niladri Choudhuri

Leave a Comment