Previously, I wrote about checklists and runbooks and how they make complicated processes easier. Having written about some of the reasons for creating these documents, I wanted to cover what goes into a good one. In particular, what makes a good runbook (sometimes known as a playbook)?
In this post, I'm focusing on runbooks around software systems, not airplanes or surgery. Trust me, you don't want my advice on either of those! If you're looking for public examples of runbooks, Gitlab has made their runbooks available under an MIT license.
There are different types of software runbooks, but in general, they are a set of directions for an engineer unfamiliar with a system. An engineer might be unfamiliar because they are new to the system, or because it's three in the morning and they got paged, or because they wrote it but haven't touched the system in a few months or years.
If the runbook is focused on incident management, it typically explains how to fix a system well enough for a few hours until an owner of a system can provide deeper diagnostics, root cause analysis (RCA), and resolution. Other runbooks might describe a business process like how to run a monthly report or a development process like how to set up a test environment.
But shouldn't everything be automated? While it is optimal to have major tasks automated, oftentimes there needs to be humans involved. The task might not be worth automating due to edge cases, complexity or frequency of change. Human judgement or approval could be required due to compliance or other factors outside the system. Or the process could be manual because there just aren't enough developers to fully automate it.
There are five attributes of any good runbook; the five As. It must be:
Let us examine each of these in turn.
It should be clear what each runbook is trying to accomplish. All the tasks should work toward that goal. A runbook is, at its core, a list of tasks. Each task should be discrete and completable. Long explanations of how systems work should be offered as an aside (perhaps a link), not mentioned in the task list. Tasks don't need to affect the system to be actionable. "ssh into the system and run a tail command" is a useful task, but won't typically affect the server. However, tasks can affect the system, for example running
cp /dev/null /path/to/logfile.
The actions should build toward the end goal. Tasks that don't help that goal should be removed. It is also important that you write the runbook with the end user and their understanding in mind. Of course you want to write for a wide audience, but at the end of the day a runbook for someone who is unfamiliar with your systems will be different than a runbook for a senior SRE intimately familiar with its nooks and crannies.
Each task should be one bullet point or line. Here are some good task definitions: "login to the web server" or "view this graph in AWS CloudWatch". In contrast a poor task definition might be something like "login to the web server, navigate to the nginx config directory, edit it to increase the number of workers (the
worker_processes value) and restart the server". There are too many things to do in this task. Make sure the team is on the same page regarding the level of detail in runbooks.
Sometimes a bit of context can be helpful. This is especially true if the task is scary or permanent; who among us hasn't checked twice when running a
delete statement on a production table or restarting a database? However, much beyond a sentence is clutter. Adding a link to a URL or reference doc is a great way to provide more context that can be accessed by the human working through the runbook as needed.
Some runbooks are for typical tasks: pulling a production database down, scrubbing PII, and installing it to a development environment. Others are for incidents or outages: how to go about troubleshooting the failure of a service. In the latter case, an RCA should take place. Closing the loop and ensuring that this learning process takes place is important. You should make sure items to be remediated are added into both the software development roadmap and operational procedures, including runbooks, as applicable.
If the end user can't find the runbook when it is needed, well, it isn't of much use. The best way to find a runbook is to have it delivered to you right when you need it. This is especially useful when the runbook is helping you manage or investigate an outage. A good practice is to have every alert have an associated runbook. Even if this runbook is a stub with only general guidance, it's a start, and can be updated when it is used.
To ease discovery, you can also attach metadata to your runbooks such as:
Another way you can make your runbooks accessible is to make them searchable. Wherever they are stored, you'll want to be able to search both the content and the metadata. A normal search interface may suffice, but you may also want to consider having them be searchable from the command line or your chat system (Slack, MS Teams, etc) depending on who the end user is.
Depending on the size and maturity of your organization, you'll want to consider who has permission to access which runbooks. While you should never store credentials in a runbook, they can still contain information about your systems which could be useful to an attacker. So you may want to have some way to tie the user looking for a runbook to an organizational directory. That will ensure users have access to what they need, but not more.
An inaccurate runbook will cause multiple problems:
So how can you make sure your runbook is accurate?
The first and best thing you can do is make sure that everyone who runs it has the ability to give feedback on its accuracy. This may include leaving a note, filing a PR, or just changing it. Depending on the size and maturity of your organization, and exactly which system the runbook is associated with, you may want to have a review policy. There's a tension between making it easy to change runbooks, which leads to group ownership and flexibility as the systems the runbook references changes, and making sure that any changes to the runbook are accurate. What you want to do is lower the friction of changing a runbook as much as possible, while maintaining accuracy. This of course depends on the system in question: runbook changes concerning a production database should be vetted more than changes to a runbook documenting how to get a development environment set up.
A tactical suggestion to increase accuracy is to cut and paste commands, rather than re-type them. This makes sure that such tasks are accurate. When writing a runbook for the first time, perform the tasks at least a couple of times. I've done this before, and though it is tedious, it is surprising what my brain misremembers.
Keep track of when a runbook was last updated (this is typically pretty easy) but also, if possible, when it was last run. Links to incident alerts or Slack conversations can be helpful in tracking this. You can also have a final task to update the 'last run' field as part of running a runbook. Unfortunately I don't know any automated way for a typical document based runbook to have 'last run' information automatically updated. Having the last updated and last run date around will be a valuable signal to the end user about the efficacy of this runbook.
Runbooks can overflow as they are edited. Remember to keep them focused on one goal. Don't be afraid to split runbooks and then insert references if required.
Make sure there is one runbook, and only one, for each process. If need be, reference other runbooks via links or any other supported mechanism. If there are multiple runbooks for a given scenario, you'll want to combine them into one and make sure the other is archived.
Not much is more frustrating than finding a great set of instructions, following them faithfully and then realizing they are out of date or incorrect, and that the right runbook was the second search result. To help with this, depending on the scale of your organizaton, you can implement a feedback system so that people who find an out of date or duplicative runbook can let the owners know about the issue.
The unfortunate truth is as soon as you write down a set of tasks in a runbook, you've accumulated technical debt. You can fight it by keeping runbooks up to date as a system changes. Almost all software systems evolve over time. Runbooks therefore need to evolve as well. If they don't, they'll be neither accurate nor actionable. Ways to encourage adaptability include:
Encouraging adaptability will make sure your runbooks remain relevant.
What should you do when a runbook is stale? First, how do you know when a runbook is stale? Unfortunately, there's no easy way. If a runbook has been last updated four years ago, that could be an indication of staleness, or it could be that the system the runbook applies to is stable. If the runbook has been updated in the last week, it could be because it's well maintained or because the coresponding system is changing relatively quickly.
The best way to see if a runbook is stale is to perform the actions in the runbook. You can also ask if anyone has used it lately, or see if it has been recently referenced in your incident management system.
At the point where you've determined a runbook is stale, you have two choices. You can correct the inaccurate or missing tasks and bring the runbook up to speed. If the system is still in use, this is the best choice.
Or you can archive it. If a runbook no longer applies, mark that fact clearly. I am afraid of throwing away knowledge, so I'll often have an archive folder and tag old runbooks with "ARCHIVED" in their title and in the first header area. But if you're more confident, just delete it. Most modern documentation systems have a way to resurrect deleted content.
Runbooks are great ways to systematize knowledge. Having all important tasks automated is a worthy goal, but humans are often in the loop. This could be because the process isn't worth automating due to complexity, usage patterns or change frequency. It could be because human judgement is needed. Or it could be because the resources to fully automate simply are not available.
Even with automation, runbooks document processes and procedures in an accessible, durable manner. You just have to make sure they are actionable, accessible, accurate, authoritative and adaptable.