The Lifecycle of an Incident: How We've Neglected the Before and After

A narrow view of the incident management lifecycle leaves organizations vulnerable.

Jessica Abelson · Sep 24th, 2020

Light loop under the stars, photo by freddie marriage from Unsplash

When a page goes off, on-call engineers and SREs jump into action, whipping off civilian clothes to expose their incident response cape beneath. Whether being pulled away from a project or jerked awake from a deep slumber, they begin a hero's journey to diagnose and remediate an issue. As the adrenaline is pumping, all that matters is the here and now—save the day. That's totally normal, especially when typical incident management products and processes often emphasize the ensuing battle. But what happens when the villainous incident has been defeated, and it's time to look at the damage? Of course we want to look away. It's even easier to look away before an incident has even happened. Who wants to focus time and energy on some potential issue when skies are blue and there's nothing screaming to be saved?

But the truth is, successful incident management is highly dependent on preparation before and evaluation after an incident. Neglecting these two critical parts of the incident management lifecycle leaves organizations vulnerable to an increased number of incidents that are longer and more painful. And while on-call engineers sure looks good in those capes, it's not just in their best interest but also your product and customers' best interests to understand how this narrow view of an incident is the real kryptonite.

Blinded by the Complexity of Modern Stacks #

This over-focus on the incident itself is a symptom of how modern management came to be in the rise of agile. Agile has brought amazing opportunities for improvement like being able to more easily respond to change, continually deploy, and move faster with less friction. But agile has also fundamentally changed how support functions within organizations, setting the expectation that on-call engineers and SRE de-bug code that isn't theirs.

Incident management has become extremely difficult with a modern stack. From delivery, to tracking, to alerting, to communication and project management, on-call engineers are constantly switching between tools. Within minutes (or seconds), signals could be firing on all fronts—an alert from PagerDuty sets off the alarm that Datadog is showing high latency on your website, teammates begin pinging in Slack in various channels, and Jira tickets are already piling in from customer support. Data is spread between dozens of tools, and critical context is lost when it's needed most.

Teams are too often scrambling to bring together the right data from the right tools along with the right people—and fast. But maybe even more troublesome for incident management is the loss of potential learning because of how complex modern stacks have become. Creating a post-mortem can sometimes be even more laborious than fixing the issue itself. It's frustrating, cumbersome, and downright annoying.

Post-mortems often miss the full context of how a team resolved an issue. How often does a post-mortem include the full interaction between the customer support team and the end-users suffering from an outage? Or at which stage the fire raged big enough for new collaborators to be pulled into the huddle? How much time was spent getting them up to speed before they could jump in and help, and what about all the steps that didn't result in resolution? Are all of those captured? When we lose this full context, we're destined to repeat the same mistakes over and over (and relearn over and over).

The True Entirety of an Incident "Timeline" #

Talking about a "timeline" of incident response can actually be misleading because the before is intrinsically linked to the after. Incidents speak to a myriad of systemic issues that could be happening before code is ever pushed to production (from lack of visibility into deploys to missed opportunities in code review) and after an incident is resolved (not properly sharing learnings, updating documentation, and providing transparency to the entire organization).

When we minimize the incident timeline from alert to remediation, we're doing a disservice to the products we create and our customers. Without zooming out to the entirety of the incident management process and seeing it as one infinite loop, we lose important context into what created the issue and how to prevent these issues in the future (or to be able to more quickly and painlessly diagnose and treat an issue).

Finding Clarity and Context During the Storm #

Incident response can often feel like a black hole where data and human communication is lost. When we lose this information we also lose the opportunity to learn and create a more streamlined and less painful incident response process going forward. Fostering a culture of transparency and using tools that intelligently track the incident management process—from what triggered an alert to who was involved to what exact steps were taken to find a fix—means less downtime of services and burnout of on-call engineers and creates a system that can learn and progress with each incident.

We have plenty of tools to build products, send alerts, and get things up and running again—but what modern day agile teams are missing is the ability to easily aggregate the data used, actions taken, and human interactions that corresponded to each step. Transposit fills this gap, providing teams with the power to unify and automate human and machine data.

At Transposit, we see the incident management lifecycle as one living, breathing entity. Our mission is to give your organization the clarity and context needed to prepare for, remediate, and learn from an incident throughout the entire lifecycle.

Sure, on-call engineers will always wear capes and be the heroes when we need them most, but it's time to start approaching incident management in a holistic way that allows our super-people to get some much deserved shut eye and ensures our services are healthy and reliable.

Try intelligent runbooks and simplified incident resolution