20 Years Since Y2K: The Future of IT Operations with Chaos Engineering and Resilience

Using Chaos Engineering and resilience alongside DevOps to manage change in our complex systems

Taylor Barnett · Jan 22, 2020

Trees near Big Basin Redwoods State Park

This is the third and final post in the series on IT Operations and DevOps since Y2K. Check out the first and second part of the series.

I’ve spent a lot of time in API design communities, and one of the things I noticed is that often people fear changes to API design. The reality is, you likely aren’t going to get it right the first time, and you should plan for future changes instead of fear it. In a talk at the API Specifications Conference last year, Claire Knight said that “change is important because your API, if successful, needs to grow and evolve.” I believe this is the same for all systems.

In one of my conversations about Y2K, Matt Stratton, DevOps Advocate at PagerDuty, and I talked about this idea of change. We talked about how people stress out about figuring “every single thing that could happen before [they] make a change.” The reality is that “you’re still going to miss something. So you may as well own the fact that you’re going to miss something and find out what you missed as soon as possible.”

I want to talk about change in the lens of Chaos Engineering and resilience for this last post of the Y2K series. Part of DevOps is about dealing with change, and I believe we can learn from other disciplines about change and its effects on our complex systems.

What is “Change?”

First, I want to define what I mean by “change.” Change in a system is when the system becomes different in some way (E.g. people, teams, specification, code, architectural change). It can both be expected and unexpected. It can cross the spectrum of known knowns, known unknowns, unknown knowns, and unknown unknowns, which is sometimes used in reference to the Johari window.

As Matt McLarty, Global Leader of API Strategy at Mulesoft, says, since Y2K “there has been a big mind shift away from, ‘we have to be afraid of change, we can’t make any changes,’ to, ‘we know things are going to fail all the time, so how do we deal with that failure?”

Chaos Engineering

What should your system do when something fails? And what is supposed to happen when that failure goes away? As described in the Principles of Chaos Engineering, it is a “discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” With more complex systems today than ever before, it is harder to know if and how different interactions within a system will cause the system to fail in some way. (We discussed the growing complexity of our systems in the first part of this series.)

Chaos engineering allows us to introduce change into the system to uncover systemic weaknesses by observing it through controlled experiments. By making improvements to these systemic weaknesses, we potentially help reduce the number of incidents, on-call burden, and severity of incidents that do occur. These experiments allow us to see how the effects of failure, or impacts, are migrated by our systems.

Matt McLarty mentioned that during Y2K, most organizations “weren’t doing any engineering around how we would isolate [failures] to minimize the impact as things did go wrong, which was more an SRE (Site Reliability Engineering) approach.” Now 20 years after Y2K, teams practicing both traditional IT operations and more DevOps practices have started moving to new approaches.

As John Allspaw has said in his talk, Amplifying Sources of Resilience: What Research Says, Chaos Engineering is an “implicit acknowledgment that we cannot understand what our system’s behaviors are by simply pulling the parts out, looking at the parts and the components, and putting them back together.”

He also said that “resilience is about funding the teams that develop and perform chaos experiments.” Chaos engineering itself is not resilience. So, what is resilience?

Resilience

Resilience is not an idea that is unique to the field of Software Engineering. The word was first used to describe a property of timber. Then in 1973 in C.S. Holling’s paper, Resilience and Stability of Ecological Systems, Holling described it as the “measure of the persistence of systems and of their ability to absorb change and disturbance and still maintain the same relationships between populations or state variable.” Since then, it has been picked up by many different communities, mostly academics, from medicine, power generation, construction, space, and others. Software engineering is only a recent addition to this list.

Resilience Engineering is a field that emerged from Cognitive System Engineering. Resilience is the adaptive capacity for the cognitive work that happens during failures in complex socio-technical systems. In the context of software, resilience enables teams to restore service and minimize impact to users.

Adaptive capacity is sustaining or maintaining the potential to adapt, especially when you cannot justify it economically. As David Woods has said, having adaptive capacity means a system “has some readiness or potential to change how it currently works.” This is what sets it further apart from a system being robust, redundant, or fault-tolerant. As John Allspaw says in multiple of his talks, it’s related to incidents or changes that are unforeseen, unanticipated, unexpected, and in some cases, fundamentally surprising.

It is often hard to know where to look for resilience. In the STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity, it introduces the idea of the above-the-line/below-the-line framework. Often a “system” in software is thought of as all of the code, infrastructure (including hardware), tests, and tooling. In this framework, all of this is below “the line of representation.” If you are talking about software and its behavior below-the-line, you are not talking about resilience. Above-the-line is where resilience and cognitive work happens. It is what makes the business work and is all about people and culture. As the diagram says, above-the-line people are “getting stuff ready to be part of the running of the system,” “adding stuff to the running system,” “architectural & structural framing,” “keeping track of what “the system” is doing,” and everything that goes into doing this work. It’s a fascinating framework of “The System” that is important to think about when talking about resilience.

I highly recommend checking out the above-the-line/below-the-line framework diagram in the report here.

While this work will take time, it does appear to be a productive approach to coping with the increasing complexity of our systems. The field of Resilience Engineering is quite new, and the conversations about it within software communities have only popped up towards the end of the last decade.

I highly recommend continuing to explore resilience through:

I’m also hoping to explore resilience in Software Engineering more in future posts. Stay tuned!

But what about DevOps?

If we want the next decade to focus on managing the complexity of our systems, this will require a cultural shift, like moving to DevOps. It isn’t just about the code itself and the tools we use within our systems. Because even with the best tools, DevOps is just a buzzword if you don’t have the right culture. This is very similar to resilience, both are found above-the-line of representation. (But to be clear, DevOps is not resilience.)

In Richard I. Cook, MD’s paper, How Complex Systems Fail, he says that “catastrophe is always just around the corner.” It is impossible to eliminate this, which is part of what makes it a complex system. People are continually responding to incidents that are near catastrophes. And incidents can always be worse. Cook says, “that successful outcomes are also the result of gambles; is not widely appreciated.” It’s something Matt Stratton also feels, “as a lifetime sysadmin, that’s the shitty part of our job. Nobody knows when you do your job well, they only know when you screw up.”

I believe that moving towards a more modern IT operations philosophy like DevOps will allow us collaboratively manage the complexity of our systems, practice Chaos Engineering, and think more about what resilience is in Software Engineering in a broader sense. Development teams throwing complex systems over the wall to operations teams does not end well for anyone. That’s why shorter feedback loops and improved quality of the software are needed to handle the complexity.

It’s humans that are capable of adaptive capacity, not what a system has. It’s humans making a hypothesis and conducting chaos experiments, not the technology itself. It’s humans that adopt the DevOps philosophy, not the software. It’s all about humans.

Try intelligent runbooks and simplified incident resolution