This blog post is part two of a four-part blog post series covering the DevOps Spectrum of Automation. If you want to learn more about the series, see the inaugural post on scripts and last week's post on orchestration.
What do you think about when you hear “runbooks?” As our recent survey results from Failover Conf confirmed, the biggest challenges for engineers using runbooks are that they are “very outdated,” “manual,” “hard to know which ones exist,” and have a “lack of standardization and automation.”
For many, runbooks are an unlikely category to be considered part of the DevOps Spectrum of Automation. If anything, they can feel like one of the most manual things we do. Compared to the last part of this series, runbooks are a far stretch from using orchestration in complex systems as a form of automation. In this post, let's take a step away from the direction of increased automation and look at traditional runbooks at the beginning of the spectrum.
While it is the belief by some that runbooks should be automated entirely away, this is often not possible when the process:
A core part of DevOps is continuous improvement of how we develop and deliver software. At the same time, the technology we build our software on top of is always evolving. APIs change, new versions are released, and architecture improves. All of which makes end-to-end automation quite difficult.
We often process how our systems work when we write runbooks. This can be useful when developing and operating complex systems, but as Dave Nunez from Stripe says in Why it’s worth it to invest in internal docs, there’s value to the whole company too:
“Proprietary data is precious to technology companies, yet some of the most valuable data lives solely in the minds of engineers. These engineers change projects, move teams, get sick, go on vacation, leave the company. They take this knowledge with them—knowledge sometimes accrued over years of learning the intricacies of a company’s technology stack and development processes. Without documentation, these lessons are only passed on to those lucky coworkers who ask(ed) the right questions. Documenting this knowledge, on the other hand, makes it available to everyone, forever.”
So, how do we gain the value of documentation while still automating what we can?
Before we talk more about runbooks, we need to talk about checklists. Checklists are a cognitive net. As the book, The Checklist Manifesto by Atul Gawande, describes it, "they catch mental flaws inherent in all of us – flaws of memory and attention and thoroughness. And because they do, they raise wide, unexpected possibilities."
In the book, Gawande says that we often fail at completing a process for two different reasons. It is either because of ignorance, like when we only have a partial understanding of something, or ineptitude, like when the understanding exists, yet we fail to apply it correctly. Checklists can help with both. They can inform us of things we don't know or guide our existing knowledge to ensure it is correctly applied.
Constantly moving pieces are hard for one person to track. Typically, we expect a seasoned engineer who has been on-call for a system for many years to have the expertise to handle a wide variety of situations. After repeatedly performing similar steps, it is easy for them to think, "I got this!" But what if there’s been a change in how the servers are monitored based on a past incident? This can lead to misapplying knowledge, potentially causing more problems. Meanwhile, if the engineer that made the change never updated a shared team checklist, knowledge siloing could prevent other team members from effectively operating their systems. Checklists ensure that every time we are taking the right steps in the right order.
At their simplest, runbooks can be very similar to checklists, but in complex environments, they can be so much more. Typically, a runbook is a list of procedures and operations for people who are on-call to follow. In teams that practice DevOps, runbooks are seen beyond on-call situations too. For example, when infrastructure needs to be set up or requires maintenance.
Runbooks aim to be helpful "how-to" guides in stressful situations. While no runbook will be "a substitute for smart engineers able to think on the fly,” as the Site Reliability Engineering: How Google Runs Production Systems book says, "clear and thorough troubleshooting steps and tips are valuable when responding to a high-stakes or time-sensitive page."
There are five characteristics of any good runbook; the five As. It must be:
You can explore each of the five As more here.
As I mentioned in a previous post, documentation is the first step towards automation in organizations wanting to adopt DevOps Automation. How can you know what to automate if the current steps are not written down? Often, this knowledge lives within just one person, which creates an operational knowledge silo. Knowledge silos are not good for incident management or operating complex systems. But the act of just writing the steps down does not suddenly turn runbooks into automation; it only sets you up for it.
It is common today that many runbooks include scripts and executable actions. Some runbooks even include things like Slack commands to run. This partial automation allows runbooks to sit on the edge of the DevOps Spectrum of Automation. They are replacing human interactions to complete common, predictable tasks, yet they are not further on the spectrum of automation because they still require significant human involvement.
Image from my Failover Conf talk on "Human-in-the-Loop DevOps"
In her paper, Ironies of Automation, Lisanne Bainbridge said that “the more advanced a control system is, so the more crucial may be the contribution of the human operator.” The human involvement in operating our systems is not going to go away, but Bainbridge did have some ideas on how the human operator could have some automated support. So, it isn’t unreasonable to ask: Is there a way for a runbook to have more automation while still including valuable human involvement?
The answer to this question is: Interactive runbooks. Teams are starting to explore executable runbooks at companies such as GitLab with Jupyter Notebooks and Braintree with Runbook, a Ruby DSL for gradual system automation. Another example of a similar idea is when you use your monitoring or alerting systems, like Datadog or PagerDuty, to kick off specific automated tasks based on alerts.
Some have even used do-nothing scripts as runbooks. A do-nothing script encodes the instructions of a step into code. When you run a do-nothing script, it walks a user through a set of steps. These types of scripts do not automate the steps of a runbook themselves, but instead, guide a script user in what to do.
All of these runbook ideas are executing scripts; therefore, they contain automation. At Transposit, we believe that this level of automation is just the beginning of a greater metamorphosis.
When we look at runbooks, we see that most of the actions in a runbook are either:
At Transposit, we recognize the value of runbook documentation as a vehicle for automation. So, we focused on a vision rooted in reality: Let the machines do what they are good at while operating a system – gathering investigative information, initiating coordination, and taking remedial actions – and let humans use their judgment at crucial decision points. The result was interactive runbooks, driven by human-in-the-loop automation. Human-in-the-loop automation is when humans intersect at critical decision points while progressively automating their systems. I'll deep dive into human-in-the-loop automation later in this series.
Rather than a human looking at a manually-written checklist and automating a few scripts that must be updated with every change to your infrastructure, interactive runbooks use underlying APIs to drive automation and to power a new way for engineers to interact with their systems. This augments the team's knowledge instead of entirely replacing it with classic end-to-end automation.
Let's say you’re paged… (Don’t worry, at a regular work hour!) Think about how many different windows you have open when you start your triaging. At a minimum, you might have:
But, what if the moment you received the page, an appropriate runbook was paired with the alert to guide you? What if all you had to do to start the triage was select actions in the runbook to begin the suggested investigative steps instead of logging in to multiple SaaS providers?
You can think of an interactive runbook as a “choose your own adventure.” In incidents, the operator must be able to see what levers (or actions) they can pull. Masking the actions away harms the human operator’s ability to do their job. While your runbook is advising what you do, it is laid out in a way that you can see the levers.
You realize you need to reboot an AWS EC2 instance because it has stopped responding and metrics stopped being reported. This would be an action you can select that is connected to AWS via API with proper authorization. You can continue to monitor the results of your mitigation steps from the actions in the runbook and ensure the incident has been resolved.
During an incident, the use of multiple tools and coordination that occurs leads to an overwhelming amount of cognitive load. By streamlining the steps while still using the same tools, the cognitive load is reduced. This makes resolving the incident less stressful, faster, and easier for on-call engineers.
In order to manage the complexity of our systems, static runbooks should be a thing of the past. We need runbooks to better adapt to the evolving systems we work in. Like the spectrum of automation itself, runbooks can range from totally manual wikis to powerful automation mechanisms that enable faster action and simplified troubleshooting. At Transposit, our focus is on using interactive runbooks to improve the lives of on-call engineers and the stability of the systems that their companies and users depend on.
In the last part of this series, we will dive deeper into the concept of human-in-the-loop automation, which spans across the spectrum. We will discuss examples of pulling humans into the loop at critical junctures that allow humans to add maximal value while automating the tedium. As Dr. Richard I. Cook says, “human practitioners are the adaptable element of complex systems.”
We’ll finally conclude the series with a practical pathway forward for operating complex systems while embracing human adaptability.
As always, I would love to hear your thoughts on automation while you practice DevOps and modern IT operations practices. Where have you seen success or failure? Tweet at @taylor_atx.