A deeper look at human-in-the-loop automation in incident response with real life examples
In the previous post of this series, DevOps Spectrum of Automation, we talked about what human-in-the-loop automation is and why it is important to keep humans in the loop at crucial decision points for increased reliability and faster resolution. Now that we understand the role of humans in automation, let’s dive into an example of pulling in humans at critical decision points during an incident using human-in-the-loop automation.
You’re on-call, asleep, and it’s 2:30am on the US west coast. You are woken up by a pager alert on your phone because of an increase in HTTP 502 error responses, causing the service to exceed your team’s service level objective (SLO) that expects 97% success rate for API requests.
End-to-end automation isn’t an option for resolving this alert because of many of the reasons we talked about in the last blog post. It needs a human investigator. Since the incident isn’t going to investigate and resolve itself with automation, you are now crawling out of bed to grab your laptop.
Before we go any further, what if you were using human-in-the-loop automation in this incident? What if we augmented part of the work of incident response while still having the human operator control?
First, with your alert, you will receive metrics snapshots from your monitoring service alongside the alert and a recommended runbook that focuses on an increase in HTTP 5xx errors. In Lisanne Bainbridge’s Ironies of Automation, she offers some suggestions for human-computer collaboration, one of them is “software-generated displays.” While these looked a lot different in 1983 than they do today, the concept isn’t far off from dashboards and graphs embedded into tooling we already use. It’s important, though, that what is displayed doesn’t jump too far ahead in the investigation process, because the human operator can easily lose understanding and lead to a slower resolution.
Second, this investigation is much like a “choose your own adventure” experience. There’s a lot of different paths you can take. With human-in-the-loop automation, it is helpful to have runbooks that lay out the different potential paths with possible actions based on past incidents or experiences.
For example, you will go through steps that involve first looking at metrics to understand the errors better. Often HTTP 502 errors indicate a problem between a proxy service and its target. With the metrics step, we can confirm it was coming from the load balancer. Next, you will retrieve the logs to see if there’s anything fishy going on. If nothing jumps out at you in logs, you will check on the status of the group of servers and/or cloud providers that the load balancer is sitting in front of. This might lead you down another path. Retrieving health checks, metrics, logs, and service statuses are all possible actions that you may take with human-in-the-loop automation in an incident. It could even be an action to update a status page if there’s degraded performance based on the incident.
Human-in-the-loop automation does not push the human operator out of the decision while still pragmatically including some automation. It always loops back to the human to use their judgment and intuition to take the next action or choose a different path. In your incident, finally, one of the paths you go down leads to you spotting what might be the problem. You then will try to resolve the issue by going down a resolution path with actions like rolling back a recent commit that started causing the problem. Once you roll back, you wait to confirm that the problem is mitigated for now. It always comes back to you, the human operator. You are pretty sleepy, but luckily the runbook lays it out clearly, which helps prevent errors, saves time, and gets you back to sleep sooner.
It would be bad practice to build automation that automatically rolled back a commit based on this specific alert. This incident needed human judgment and context. This is why we need a human-in-the-loop for incident response. If you were able to automate handling this alert, automation generally requires a speed versus correctness tradeoff. It is too hard to validate automation’s correctness because it moves too quickly, especially during an incident. The human operator will lose the context of what is going on, and it will potentially cause more problems if a human was needed to intervene. A human has the intuition and emotional intelligence to decide if this is the best decision for customers.
Human-in-the-loop automation can include historical information and inform us from past incidents. It can surface helpful documentation like runbooks and lessons from previous incidents. Have you ever tried to search Confluence or other wikis under pressure for useful information? It isn’t easy. Connecting alerts with runbooks and additional context can guide on-call engineers to a faster resolution. As Ian Miell’s blog post about runbooks says, "if you have a wide-ranging set of contexts for your problem space, then a runbook provides the flexibility to [be] applied in any of these contexts when paired with a human mind. For example: your shell script solution will need to reliably cater for all these contexts to be useful; not every org can use your Ansible recipe; not every network can access the internet."
In another incident, you might see a specific percentage of an API request having high latency over a window defined in an SLO. You could set up human-in-the-loop automation to automatically get all the logs from different systems related to that particular API request. As Niall Murphy said in his recent fireside chat with our CTO and co-founder Tina Huang, “no one cares about the single machine case anymore. Single machine is dead.” Our call logs are much more distributed today. We run analytics and monitoring in one place, web services in another, and virtual machines somewhere else. This type of automation allows the human operator to apply the runbook based on what is observed in the logs with the correct context needed all in one place while triaging, coordinating, mitigating, and resolving the incident.
Manual workflows often contain many steps and are error-prone, like typing the wrong command in a command line. Think of all the incidents that were caused by running the incorrect command or because a step was skipped. While the fallout for these missteps might be evidence of more significant issues, human-in-the-loop automation can save time and mitigate these problems by defining a common set of actions that can be taken. Often these manual processes are used for communication during an incident, both internally and externally. Think about everytime you believe an API or service you use is having an outage, yet you see green across the board on their status page. It’s pretty frustrating.
Imagine being on on-call again, luckily not at 2:30am, but you are dealing with an outage that affects customers and systems that are owned by other teams. Even if you have an incident commander that you are working with, this requires a lot more coordination than something that only affects your team.
Human-in-the-loop automation can also automate coordination with other stakeholders while the human operator is in charge. For example, sometimes, specific actions might need approval or coordination with other team members or teams. You can build into your human-in-the-loop automation this coordination without taking up valuable time of an on-call engineer, like reaching out to specific stakeholders on Slack. In a single incident, we often see six or more communication platforms being used: a phone, internal ticketing, a wiki for documentation, customer ticketing, chat for team collaboration, email to update other relevant stakeholders, and possibly even more. Streamlining communication can save valuable time and focus in the middle of an incident and after an incident is over.
There’s often a lot to consider in an incident. We believe that you should let the human be the detective while the system is the trigger. Transposit wants to provide tooling to help you investigate while making it easy for you to take action.
You can piece together your human-in-the-loop automation, but the reality is you need to have integrations to a lot of different services to provide the context, data, and actionability and tooling to make human-in-the-loop automation and interactive runbooks useful during an incident.
Transposit's interactive runbooks that learn can guide engineers with human-in-the-loop automation that incorporates past experiences and lessons learned from incidents. Actionability via integrations makes it easier to build out human-in-the-loop automation for incidents that can range across the spectrum of automation. It allows any team to implement a more pragmatic approach to automation than “just automate” the whole process. This approach both embraces humans while augmenting parts of the human operator’s job that often feels like toil and slows down incident response.
Hopefully, these examples give you a better idea of including more automation into your incident response and how to make the on-call experience not something that your team dreads and feels overwhelmed by. If you want to chat about the idea of human-in-the-loop automation or how APIs and humans can work better together, I’d love to hear from you. You can tweet at me @taylor_atx on Twitter.