Pre-Mortem: How SREs Can Prepare for Incidents

When incidents inevitably hit, does your team have a game plan? Prepare for any situation with communication, data, and process.

Jessica Abelson · Oct 28th, 2020

Prepare for Incidents with Transposit

On Sunday, PG&E (Northern California’s power & gas supplier) was set to turn power off for over a million people in preparation for large winds predicted to sweep in. The city of Berkeley even issued a pre-emptive evacuation order. Power outages, people fleeing their homes—all this and no fire had even started. Californians were prepared for the worst.

We can take the same precedent into our incident management. How can we better predict and prepare for incidents before they even ignite? Creating a well-defined game plan means SREs and on-call engineers can spend less time orchestrating how to run the incident management process and more on resolving the actual issue.

Define Communication Guidelines #

When incidents hit, it’s a scramble to bring together the right people in the right place. It can take upwards of 20 minutes from alert to orchestration. Then there’s the issue of all the peripheral people (stakeholders) who need consistent updates and answers, distracting SREs and on-call engineers from focusing on the critical task at hand.

You need communication guidelines. Set up a checklist that outlines the “who” when” and where”:

Clearly defined the “who,” “when,” and “where” will save SREs and on-call engineers lots of stress and scrambling and ultimately mean faster resolution.

Centralize Data From Various Tooling #

Data is the key to triaging and investigating an incident. But just having data doesn’t mean it’s useful. If you have three engineers looking at the same data but at different times and different places, you may be coming to different conclusions.

For effective collaboration and decision-making, data from various tooling should be centralized in one place. Slack is a great example. If you can pull all your data together into one “war room” channel in Slack with all necessary parties, you can be assured you’re all functioning under the same considerations.

Use these steps to centralize your data:

  1. Make an accurate inventory of the tools you use to access data
  2. Choose one location where you’ll be looking at data and collaborating as a team (Slack, Microsoft Teams, Transposit, etc.)
  3. Connect your tooling to that chosen location so data can be grabbed instantly (even automatically)

Prepare Your Response Plan With Runbooks #

The last essential part to your incident preparation plan is process. Using runbooks turn chaos into clarity. But if runbooks are out-of-date or you can’t access them easily, you may be spending more time trying to use them than using none at all.

So what makes runbooks effective? Here are the 5 attributes of a good runbook:

The last thing you want is your runbooks to sit on a shelf (or wiki) collecting dust and then rigid once you try to use them. Runbooks should reflect the accurate information, allow you to take action, and be easily accessible in the heat of the moment.

Evaluate Your Incident Management Plan #

At the end of the day, metrics will tell you how successful your incident management is. Has improving your communicating, ensuring access to data, and using effective runbooks made incident resolution faster and less painful? Have you cut down your MTTR? Have less customers been impacted?

Just as you should with you runbooks, make sure you’re evaluating your incident management plan frequently so it doesn’t go stale and reflects new experiences.

Create an Incident Game Plan With Transposit #

If you’re looking to create an incident gameplan, Transposit can help:

Try intelligent runbooks and simplified incident resolution