Whenever you're developing and operating software, incidents are going to happen. As Paul Hammond said in the infamous talk, “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr,” “it’s not a question of if, it’s a question of when.” This is because change is constant in our engineering environments.
Businesses and technology innovation require change to survive. In the late 1980’s, when IT Service Management (ITSM) was created, change needed to be heavily managed, and change management processes were required to manage it. The rate of change was very slow. Initially with shipping boxed software, change could not easily happen in a few days or even a few months. Today, whether your team is practicing DevOps, SRE, or other more modern operations practices, change happens in our code, infrastructure, and teams much more frequently. Therefore, incidents are not going anywhere. They are here to stay, but we can choose to focus on making them less painful, time-consuming, and something many have grown to hate. This is where improving incident management and response comes in.
Often in writing about how to improve incident management and response, the phrases are used interchangeably. In this blog post, I want to explore if they are really the same thing? What’s their relationship? And why is this important for on-call teams?
IT incident management was popularized by ITSM. Incident management is the process of managing unplanned disruptions to a service that affects customers and restoring the service to its operational state. Incident management often begins with someone identifying an incident and logging it. It also includes the classification of an incident, alerting, communication, incident response, and post-incident steps. Since the concept of incident management was popularized, the velocity of software development has accelerated. In web operations and cloud services, most teams aren’t using the same change management processes used with ITSM. Incident management is now in a new era.
Typically, incident response is seen as a subset of incident management. Incident response’s goal is to contain and mitigate an incident once it is discovered. Often this includes steps of triaging, coordinating within engineering teams, mitigating, and resolving an incident. In 1988, the first Internet worm, the Morris worm, was released. The Morris worm led to the creation of the Computer Emergency Response Team Coordination Center (CERT/CC) by DARPA. This was one of the first formalized centers for incident response on the web. But conceptually, incident response very much predates the internet and has been used in and studied for decades in infrastructure, medicine, nuclear engineering, and transportation, also known as safety-critical systems.
I often see incident management and response used interchangeably by marketers. While it’s fairly harmless, I think it is essential to distinguish between the two for a few reasons. Incident management and response should work together, but it is useful to recognize they can be different processes and tools.
First, on-call doesn’t have to be so painful, but sadly today, stress, overload and feeling under pressure are far too common. Often this is actually a symptom of more significant socio-technical problems within our teams. There are two parts to this pain: How we build our systems and how we respond to incidents in our systems. I want to focus on the latter. To respond to incidents in less painful ways, we don’t need to only improve how we manage an incident but how we respond to incidents, including investigating, remediating, and learning from incidents. When we group incident response in with incident management processes, we are often limiting how we could improve incident response. While process and tooling can’t fix all of these problems, they can make specific tasks less time-consuming. As we often say, we aren’t going to automate away all our incident response, and change isn’t going anywhere. So we must work to figure out how to make it less painful for the human operator.
Second, when I’m on-call responding to an incident, I rarely feel like incident management tools help me do my job of incident response. They might help other stakeholders or leadership gain understanding and metrics around an incident. Still, they don’t help me take action to respond to an incident, like searching through metrics, logs, and other activities that are needed to respond to an incident. They don’t give me the actionability I need to drive resolution. There’s a disconnect between the incident management tools and processes and actual infrastructure, code, and services where an incident occurs. I often see incident management as a “business” term, while incident response is an “engineering” term that an on-call engineer is a part of. As an on-call engineer in an incident, I would need to triage, mitigate, and resolve an incident, this is what incident response looks like.
Third, it might seem silly, but words matter. There is power and control associated with “management,” but often in incident response, it is hard to feel like we have control. It’s more like chasing down a problem, racing against time while trying to gain control. This is why incident response can often feel different than higher-level incident management.
You want actionability in your incident response tools and processes that drive resolution and gain control of incidents. There are some great blog posts out there that walk through the elements of incident response, but what should be in your incident response toolbox to drive actionability? This might look a few different ways: Source controlled and versioned scripts, high-quality runbooks, and human-in-the-loop automation.
A familiar experience to many is pulling a script to run out of a wiki. For example, you are in the middle of an incident, and there is a wiki with documentation from a fellow engineer on a useful script that could mitigate the incident. You click the link to the script, and a 404 appears, it’s nowhere to be found. Or you have the script in a code block, but you have to dig around to find out, when is the last time this script was updated? Or how many times is the script only stored on a random engineer’s computer?
Shared scripts are a great example of actionability in incident response because they can be particularly useful to accomplish a singular operational task. Still, they need to be easily findable, whether that’s in a centralized repository or runbook. For future improvements to incident response, if changes need to be made to a script based on insights from an incident or routine maintenance task, we need a way to propose changes, create a new version, and populate it to the rest of the team.
A runbook is almost useless if it isn’t up-to-date and accurate. We’ve talked about this in previous blog posts, but we often refer to the five A’s of runbooks. Besides being actionable and accurate, they need to be accessible, authoritative, and my favorite: adaptable. For a runbook to stay accurate, which allows it to be actionable during an incident, engineers need to have a way to quickly adapt it to changes in the system and different contexts that exist in the system. For example, hard coding in specific host or cluster names that should be variables in a runbook is a bad practice since it doesn’t handle different cases of a similar problem easily.
At Transposit, we believe that runbooks should be interactive, not written in stone. Teams need an easy way to adapt them from past incidents, use them in different contexts without extra effort, and quickly update them. Also, actionability in runbooks means more than just plain text and not all interactivity is created equal. Right now, most interactivity in runbooks is done through surface level integrations that are either not bi-directional or only cover a small subset of what a service can do. Sure, webhooks are a great starting point, but often they only support outbound data from another service, not direct action in other systems.
That’s where Transposit’s direct action commands make interactive runbooks truly actionable, like one-click remediation actions in systems where your infrastructure, monitoring, logs, and communication lives. It is important that integrations work together with the tools, services, and cloud infrastructure you are already using. This opens up the possibilities for the next tool in our toolbox: Human-in-the-loop automation.
If you have ever read the Transposit blog before, you probably know we are fans of human-in-the-loop automation. Human-in-the-loop automation is when humans intersect at critical decision points while progressively automating their systems. Human-in-loop automation aims to find the right balance between end-to-end automation and human involvement. With only having end-to-end automation, you might find your attempts at automation to be unsuccessful or even harmful in an incident.
In incident response, human-in-the-loop automation augments part of the work of incident response while still having the human operator control. For example, when we include an integration in a runbook to go get information from another system based on alert but have the human operator in control of the next steps based on their observations. I wrote a more extensive example of a real-life incident in this blog post using human-in-the-loop automation.
Having an alerting tool is not enough. If you are focusing on incident management and not focusing on the actionability of your incident response, you need tooling and processes that promote actionability and levels up your on-call engineers. Alongside a culture that supports on-call engineers, actionability can alleviate some of the pains engineers face while on-call. This creates happier engineers as everyone walks the path to a more sustainable on-call practice.