Deployment time and frequency, test coverage, service level objectives (SLOs), and mean time to recovery (MTTR) are some of the metrics that engineering organizations practicing DevOps care about, but what about metrics around documentation?
When a pager goes off, documentation can quickly become the most useful tool that an engineer has at their hands, yet we don’t reward documentation work equally. It’s the lowest level priority in many development budgets. By engineers, it is seen as unrewarded drudge work that often goes overlooked despite most engineers and operations teams using it every day. By managers, it is a checkbox requirement.
So, how do we make sure we produce good functional DevOps documentation? In this post, I make a case for why it matters, what makes it hard, and how it can become more common through rewards, including some metrics to consider.
Note: I am focused on the types of documentation that can’t be automatically generated in this post. Although automatically generated documentation can be useful for specific purposes.
As I talked about in one of my blog posts in December, the complexity of our software systems has dramatically increased in the last 20 years, which includes the introduction of DevOps. As complexity increases, documentation has become even more important. Documentation is the primary source of knowledge sharing within an organization and is critical to on-call success.
It is common in organizations practicing DevOps that the software teams take on a bulk of the on-call responsibilities. This makes documentation within these teams critical to incident response and the reliability and availability of the software they build. With DevOps also comes expanded participation in operations related tasks. One of the problems is, often, engineers who have been on the on-call rotation for a long time or are more familiar with specific technology have tribal knowledge that isn’t always documented. It often is passed down verbally or in chat messages. This makes it harder to onboard new engineers into an organization and on-call rotations.
A shift to DevOps is a monumental cultural shift for many organizations. Documentation can help build a common language across the organization that encourages DevOps success. It can help contextualize and motivate people within an organization to move towards best practices.
Lastly, documentation can be a path to more automation. You cannot automate toil away if you haven’t laid out the steps, for which checklists in documentation are perfect.
I’ve seen that often developers have no desire to write documentation. It requires time and effort. When there’s pressure to complete tasks or deliver software on time, an engineer would rather use the time elsewhere.
As Riona MacNamara from Google says, “documentation is engineering work.” Though sadly, it isn’t always seen that way. Engineers are heavily evaluated on their ability to write code, not documentation, so there is no formal incentive structure to encourage writing documentation. Since engineers are heavily evaluated on their ability to write code, they are taught skills of problem solving and algorithms, not how to write technical design documentation and checklist instructions clearly.
So when a team practicing DevOps needs runbooks for their on-call engineers to investigate an incident, it is already an uphill battle.
This challenge is not unique. There are parallels to other areas within software, such as testing. A few decades ago, you might not have seen “ability to write well-tested code” in a job description, which is now more commonplace. It used to be done more ad hoc, but now it is professionalized and a core part of the engineering lifecycle. The same way automated testing encouraged the creation of tests and testable code, DevOps could do the same for documentation through the pain of maintaining services and being on-call.
In Heidi Waterhouse’s talk, “The power of positive transformation,” she talks about how to make change happen in your organization through positive training. She says that not allowing people to check-in code because of the lack of documentation is a negative training moment. It doesn’t create lasting change and causes a negative sentiment around documentation.
So how do you get your team to write good functional documentation through positive transformation? It’s not useful to tell people to “just write the docs.” It all comes down to rewards. You need to create reward structures that appeal to the engineers who need to write the docs. It’s important to ask: What rewards this person? This is the time to figure out rewards together.
Often engineers don’t understand how far their documentation can reach. Although, what if an engineer (and even their manager) finds out that by writing a piece of documentation, it would contribute to faster resolution times, improve the team’s mean time to recovery, and meet the team’s SLOs? Or maybe it will help them have a better performance review and build a case for them to get promoted to senior engineer? Or a raise? As Heidi says, “We always want more money.”
These expectations around documentation need to be written into engineering levels too. You can’t expect someone to magically start doing something they had no idea there was an expectation. For example, in GitLab’s Site Reliability Engineering levels, it clearly states what is expected:
The junior and mid-level roles include:
"Improves documentation all around, either in application documentation, or in runbooks, explaining the why, not stopping with the what."
While the staff level roles include:
"Writes in-depth documentation that shares knowledge and radiates GitLab technical strengths."
Even though I believe it is a managers’ job to show the value of documentation to the teams and organizations, individuals on a team can contribute too. For example, it is helpful for another engineer to mention that they are glad a specific piece of documentation was available during an incident or routine maintenance task. This is one data point that can be collected.
Managers should also make sure to set realistic standards for quality. Engineers were not trained to be technical writers, so they should aim for better quality, instead of the best quality. It is no fun for anyone when pull requests doesn’t get merged because you’ve spent weeks going back and forth over grammar changes.
Riona MacNamara from Google has a great talk on how to do docs better. It includes some useful insight into how to judge documentation quality. According to Riona, there are two ways to measure a piece of documentation: Structural and functional quality.
Structural quality is related to the style and usage of guidelines, including spelling, grammar, voice, tone, and if it is well-organized and easy to navigate. I talked about some of this in my last blog post on writing runbooks. For functional quality, you have to ask a few questions of the document: Is this document effective? Does it do what it is supposed to do? For example, does a specific runbook reduce the time to troubleshoot during an outage?
Riona proposes that:
If a piece of documentation is well written but functionally useless, the document is not helpful during an incident. (This isn’t to say it can’t improve and become useful!) Functional quality must be the overall goal of documentation. See this part of Riona’s talk if you want an example of how one piece of the documentation might meet all of its core functions.
Often it is easy to fall into the trap of forgetting that it is okay that some of your organization's DevOps metrics might be more qualitative instead of quantitative when you have lots of quantitative metrics in your DevOps practice. Sentiment data can be one source of metrics based on qualitative data. For example, if an engineer shares that they found specific documents written by another engineer useful, this data could be included in your documentation metrics.
You can also pull metrics from user behavior data. For example, if an engineer was successfully able to follow a checklist in the documentation, this is data to collect and a sign of high functional quality. Another example metric is if an alert was successfully handled using documentation, and no documentation related issues appeared in the postmortem. It can be hard to compare data on multiple documents with this metric because not all alerts occur and have their corresponding runbooks used at the same frequency.
Here’s an example of a documentation task that an engineer may work on:
Decrease the time it takes for a new engineer to go on-call with an improved runbook for a frequent alert.
Whether there are plenty or just a few engineers in your organization who contribute to documentation, it is useful for everyone to think about how to reward and incentivize writing documentation. Positive transformations rarely happen by chance, and having potential metrics to consider going into the transformation can make the process less onerous. If you have other metrics for when engineers are writing documentation, I’d love to hear them. You can find me at @taylor_atx on Twitter.
P.S. I did a super scientific Twitter poll on this topic a few weeks ago when I was thinking about it for only Site Reliability Engineers. 14 out of 63 respondents said documentation was considered “To a Great Extent” in performance reviews. If this is true for your organization, I’d love to find out more about it!