Join instigator and co-author of the Google SRE books, and Azure SRE lead, Niall Murphy, and Transposit co-founder and CTO, Tina Huang, for a fireside chat about how to make the most of automation in your engineering org on July 14th at 10am PST. Sign up here.
This blog post is part four of a five-part blog post series. This series has covered the spectrum of automation that we often face while practicing DevOps and other modern IT operations. If you want to learn more about the series, see our past blog posts on scripts, orchestration, and runbooks.
In many cases, end-to-end automation is not going to make operating software easier. I often get the question: “Why don’t you just automate it?” But not everything can be turned into end-to-end automation. This doesn’t mean the value of automation stops there though. In this blog post, I want to explore what human-in-the-loop automation is and how it allows us to embrace humans while progressively automating. Also, how it can surface useful data and information easier and improve the on-call experience with less cognitive overload and faster resolution times.
Human-in-the-loop is typically defined as a model that requires human interaction. In a human-in-the-loop model, humans can interact with and influence the direction of the model or simulation they are training.
Human-in-the-loop automation is when humans intersect at critical decision points while progressively automating their systems. Human-in-loop automation aims to find the right balance between end-to-end automation and human involvement. With only having end-to-end automation, you might find your attempts at automation to be unsuccessful or even harmful to operating your systems.
Human-in-the-loop automation is different from end-to-end automation that you might use in your CI/CD pipeline, backups, database management, orchestration of a multi-tier architecture, or other processes that have no human involvement from start to finish. This isn’t to say that there aren’t different levels of automation while practicing human-in-the-loop. It can range depending on what you are automating. As you can see in the spectrum of automation, human-in-the-loop automation spans a wide range of the spectrum from runbooks to orchestration. The key indicator is human involvement, where humans control the decision-making process, not machines.
We often need humans to make decisions at critical decision points for a few reasons. Humans have the ability to adapt to change. In Dr. Richard Cook’s paper, How Complex Systems Fail, he says, “practitioners and first line management actively adapt the system to maximize production and minimize accidents. These adaptations often occur on a moment by moment basis.”
These adaptations happen on our teams all the time. To adapt means to adjust or modify fittingly. Many team members carry specialized knowledge, “islands of knowledge,” that help them adapt and make complex decisions considering unexpected factors. Have you ever asked a subject matter expert about a problem and their answer is something like “it depends...?” I feel like it has happened to me many times over the years. How is automation supposed to handle this?
It is hard to automate the unknown knowns and unknown unknowns, which humans are uniquely equipped to handle based on their past experience and expertise. Past events provide humans with the intuition that is beneficial in the middle of an incident. We often see that on-call engineers can confirm or disqualify diagnosis by intuitively matching signals and symptoms of past incident and apply this to a current investigation.
End-to-end automation can’t do coordination. You can’t redirect or tell end-to-end automation to focus on something specific once it has started. Redirection is essential to coordination in socio-technical systems. The Interaction Design Foundations defines a socio-technical system as “one that considers requirements spanning hardware, software, personal, and community aspects.” An incident in a socio-technical system is not just a technical problem but also a social problem between humans that requires coordination.
How would you coordinate with auto-scaling that is running to handle different load patterns in your applications? You can’t. It happens automatically, and in the moment, you have little control over it. While we’ve accepted these trade-offs for auto-scaling in some situations, there are situations where we choose not to let a task automatically happen because we need to coordinate with different systems, teams, or both.
In the Ironies of Automation, Lisanne Bainbridge says that “the classic aim of automation is to replace human manual control, planning, and problem solving by computers,” but that “it is ironic to train operators to follow instructions, and then put them in the system to provide intelligence to it.” Automation can take away human operators’ ability to be context-aware. It is essential to focus on automation that embraces humans’ ability to be context-aware rather than take it away and focus on a rigid set of instructions.
A vast array of tasks need human subjectivity or guidance. As developers, we cannot program all of the paths that automation may need to take. If we tried, it would take a long time to automate something or become very fragile if we didn’t program enough of the paths.
Have you ever been paged during an incident with automation running, but didn’t really understand it or how its decision tree was formed? Engineers will say that even auto-scaling can cause these issues for an on-call engineer responding to an incident.
Also, automation cannot see the socio-technical system that we work in. Dr. Richard Cook developed a model of what is above and below the line of representation. Often we see “the system” as the things below the line, like APIs, hardware, tests, orchestration, databases and more, but humans never actually interact with code. Instead, all we have is the representation of the code on our computers. Automation exists below the line. Automation cannot see the cognitive work, goals, or risks that are being considered above the line; therefore, it does not have the context. Only humans hold the context when operating systems. The idea of human-in-the-loop readily allows for identifying goals, purposes, and risks that are not easily identified by automation itself.
In the next part of the blog series, let’s explore more examples of using human intuition at critical decision points with human-in-the-loop automation. We’ll demonstrate how it can make incident management less of a headache, help communication, and aid incident response while leading to faster resolution times.