While each type of documentation helps minimize operating time by enabling efficiency and streamlining workflows and processes, there are key differences to be aware of
SRE and TechOps teams use documentation to enhance visibility, drive consistency, and ensure teams are equipped with the knowledge to efficiently tackle tasks. Various types of documentation play a part, but what are the differences and how can teams use them more effectively?
While occasionally used interchangeably, the software industry typically defines the terms this way:
A runbook is a set of instructions for completing a routine task. For example, the runbook might be a how-to for setting up a server, deploying software to production, database backup and restore tasks, regularly generating reports for customers, or responding to an incident. Sometimes runbooks are automated or have automated elements.
A playbook is a more general document, outlining the organization’s approach and worker responsibilities. A playbook might provide more in-depth information about the tasks’ cultural and compliance aspects, like user experience and privacy.
Typically, engineers define their own scripts to perform regular tasks or debug applications. Playbooks and runbooks help share the scripts among the team so that they can audit, review, and improve the documents as needed.
A standard operating procedure (SOP) describes a procedure to follow including how to adhere to industry regulations. These procedures aim to maintain a consistent approach, no matter the task.
SOP is a broad term in almost every industry and organization. A colleague wrote previously about his experience using SOPs within in Navy to respond to incidents. DevOps, SREs, and the rest of the software industry prefer the terms playbook and runbook.
While you can use the broader term SOP as meaning higher-order guidance, playbooks and runbooks lay out the response to particular incidents or specify how to perform any routine duty, such as deploying a new container instance on the cloud or running an infrastructure backup.
Although they’re different, runbooks, playbooks, and SOPs have one thing in common: they minimize operating time with higher efficiency and improved success rates. This post describes runbooks and playbooks and how they can help remove the skill gap between teams and ensure success at work.
A playbook is a unique overarching set of guides that an organization has prepared and compiled for its teams. In contrast, a runbook is a specific outline for helping with a task, bridging the differences in staff skill sets. For example, a senior SRE can create and publish the runbook so that the rest of the team can use it to perform their duties efficiently.
In other words, a playbook is to a runbook what a car manual is to a tire repair guide. A playbook might contain higher-level objectives and routine tasks that a company might not use daily. A runbook, however, helps outline how to perform specific tasks. If automated, the runbook might remove or reduce redundant tasks performed by the SREs.
You can find an example of Ansible Playbooks in Ansible’s open-source repository. These playbooks organize routine tasks by topic, and a single document contains multiple operations with the proper verifications. Following software principles, a playbook or runbook can be versioned and audited, although since playbooks are higher-level, reviews may only happen every few years. Runbooks, however, change to accommodate optimal tasks and operations methods as the infrastructure changes. Every infrastructure version should have runbooks to help DevOps and SREs quickly respond to incidents.
Runbooks often provide documentation supporting the technical operations teams in routine tasks, which can be audited regularly and updated as needed. You can also use the SOPs in DevOps or SRE cookbooks, which help SREs perform their routine backup and recovery tasks.
When incidents occur, a TechOps or SRE team’s primary focus is to identify what caused an undesirable behavior or incident and restore service as quickly as possible, often through a temporary fix. Every organization has some recovery operations to restore the system to a functional stage, including clearing the cache, draining the containers from an orchestrator, and resetting configurations. The action that the company needs to take depends on what kind of incident occurred. Then, in the longer term, organizations use problem management to identify the ultimate root cause and help the system become more resilient and reliable.
Incident management involves a complex web of information. To better assess the incident and find a solution, teams must fetch logs from various monitoring servers, check the application stats from the production environment, and then perform steps to bring service back online. Then, software teams must often use multiple products and services to create an incident report. For example, teams might use Jira for a ticket, GitLab & GitHub for version control, and Slack or Microsoft Teams for communication.
Here’s where runbooks help: They contain the historical knowledge gained from past incidents and retrospectives. They become a living record of best practices learned over time.
A runbook helps share the knowledge of subject-matter experts, senior engineers, and those who have continually handled this specific task with the broader community, even if they are unavailable or no longer part of the organization. This turns institutional knowledge into documentation that everyone can use.
You can create new runbooks based on previous problem-solving experiences. It can be a manual for the team, a semi-automated operation with a team member 'looped-in', or fully automated to perform routine duties.
Start with finding the repetitive tasks in your routine. Adding a runbook can help avoid human error while increasing operations speed and efficiency. For example, if your SRE must reset an application’s configurations every day, creating a human-in-the-loop automated runbook helps. Subject-matter experts and senior employees often draft the documentation for the steps needed. These experts ensure that the workflows and processes are executed with industry compliance and regulations in mind.
A good runbook is:
A set of runbooks should also cover a complete software development lifecycle and be an active part of that cycle. Be sure to keep the document updated with all the latest changes in the infrastructure or solution. Also, update the runbook with the changes used to solve the problem and whenever the organization’s practices shift. A runbook template can help you get started.
Documentation is never finished. Organizations must nurture and update the information over time to ensure it’s current and usable when needed. Assign an owner to the documentation to be responsible for its upkeep. The owner should continually seek feedback from the runbook’s users, using their experience to update and improve it.
A great way to get immediate feedback is through dynamic documentation. As opposed to static documentation with simple formatted text, dynamic documentation consists of rich, real-time information, functioning as a live, managed-as-code, data-centric, and integrated center of knowledge. Dynamic documentation encompasses human-in-the-loop automated runbooks with automatic documentation of all human and machine actions.
Dynamic documentation ensures that runbook owners know who used runbooks and how. This historical knowledge provides real-time feedback for the runbook owner to improve the runbook over time.
Although many DevOps practitioners use the three words interchangeably, runbooks differ from playbooks and SOPs. To learn more about runbooks and the seven sections that comprise a good runbook, read Tom Limoncelli’s Operations Report Card. It’s also helpful to read Transposit’s guide to runbook concepts and best practices.
A runbook should be self-sufficient and enough for anyone across development and operations to complete a task. Runbooks for incident management must be easily understandable in the context of an incident and incident response.
Transposit’s connected workflow platform can help team members create an automated task that anyone on the team can execute to begin the process toward resolution. Human-in-the-loop automated runbooks can trigger typical repeatable tasks, such as creating a Slack channel or starting a ticket, each time an incident occurs. This type of runbook serves as dynamic documentation, enabling teams to automate tasks, accelerate response, reduce manual toil, and collaborate using real-time data from any system of engagement.