Pre-Mortem: How SREs Can Prepare for Incidents

When incidents inevitably hit, does your team have a game plan? Prepare for any situation with communication, data, and process.

Jessica Abelson, Director of Product Marketing
Oct 28th, 2020

On Sunday, PG&E (Northern California’s power & gas supplier) was set to turn power off for over a million people in preparation for large winds predicted to sweep in. The city of Berkeley even issued a pre-emptive evacuation order. Power outages, people fleeing their homes—all this and no fire had even started. Californians were prepared for the worst.

We can take the same precedent into our incident management. How can we better predict and prepare for incidents before they even ignite? Creating a well-defined game plan means SREs and on-call engineers can spend less time orchestrating how to run the incident management process and more on resolving the actual issue.

Define Communication Guidelines

When incidents hit, it’s a scramble to bring together the right people in the right place. It can take upwards of 20 minutes from alert to orchestration. Then there’s the issue of all the peripheral people (stakeholders) who need consistent updates and answers, distracting SREs and on-call engineers from focusing on the critical task at hand.

You need communication guidelines. Set up a checklist that outlines the “who” when” and where”:

  • Who on your SRE team can help if you need to escalate the issue?
  • Who can you reach out to on various engineering teams to account for any and all incident possibilities?
  • Which stakeholders need to be involved (legal, marketing, customer service, c-level execs, etc.)? And how do you contact them?
  • When and how often should you be updating stakeholders?
  • Where does communication with these parties take place?

Clearly defined the “who,” “when,” and “where” will save SREs and on-call engineers lots of stress and scrambling and ultimately mean faster resolution.

Centralize Data From Various Tooling

Data is the key to triaging and investigating an incident. But just having data doesn’t mean it’s useful. If you have three engineers looking at the same data but at different times and different places, you may be coming to different conclusions.

For effective collaboration and decision-making, data from various tooling should be centralized in one place. Slack is a great example. If you can pull all your data together into one “war room” channel in Slack with all necessary parties, you can be assured you’re all functioning under the same considerations.

Use these steps to centralize your data:

  1. Make an accurate inventory of the tools you use to access data
  2. Choose one location where you’ll be looking at data and collaborating as a team (Slack, Microsoft Teams, Transposit, etc.)
  3. Connect your tooling to that chosen location so data can be grabbed instantly (even automatically)

Prepare Your Response Plan With Runbooks

The last essential part to your incident preparation plan is process. Using runbooks turn chaos into clarity. But if runbooks are out-of-date or you can’t access them easily, you may be spending more time trying to use them than using none at all.

So what makes runbooks effective? Here are the 5 attributes of a good runbook:

  • Actionable: It’s nice to know the big picture and architecture of a system, but when you are looking for a runbook, you’re looking to take action based on a particular situation.
  • Accessible: If you can’t find the runbook, it doesn’t matter how well it is written.
  • Accurate: If it doesn’t contain truthful information, it’s worse than nothing at all.
  • Authoritative: It is confusing to have more than one runbook for any given process.
  • Adaptable: Systems evolve over time, and if you can’t change your runbook, the drift will make it unusable.

The last thing you want is your runbooks to sit on a shelf (or wiki) collecting dust and then rigid once you try to use them. Runbooks should reflect the accurate information, allow you to take action, and be easily accessible in the heat of the moment.

Evaluate Your Incident Management Plan

At the end of the day, metrics will tell you how successful your incident management is. Has improving your communicating, ensuring access to data, and using effective runbooks made incident resolution faster and less painful? Have you cut down your MTTR? Have less customers been impacted?

Just as you should with you runbooks, make sure you’re evaluating your incident management plan frequently so it doesn’t go stale and reflects new experiences.

Create an Incident Game Plan With Transposit

If you’re looking to create an incident gameplan, Transposit can help:

  • Centralized communication in Slack means engineers can collaborate together looking at the same data being pulled from all your various tools.
  • The incident dashboard gives transparent updates to stakeholders so they’re not distracting engineers from resolving the issue.
  • Interactive runbooks create set processes and human-in-the-loop automation that are simple to update and provide actionability in the context of data.