Supercharging Your Stack with Interactive Automation - Part Three

How to decide what to automate first

Ashley Roof
Aug 21st, 2020

“Automate Everything” is a popular mantra in modern DevOps and SRE, yet human judgment is still a key factor in every stack’s management, from daily operations to incidents.

A few weeks ago, instigator and co-author of the Google SRE books, and Azure SRE lead, Niall Murphy, and our Transposit co-founder and CTO, Tina Huang, met up via Zoom from California and Ireland for a fireside chat about how to make the most of automation in today’s complex engineering environments across DevOps, SRE, and beyond. They touched upon a number of interesting topics, from how the movement from on-prem to multi-cloud has changed the role of automation in a stack, to the constraints and considerations around using machine learning to replace human judgment.

In the third installment of this series, we’ll continue on with our favorite highlights of what Niall and Tina had to say around how to choose what to automate and what to leave in the hands of engineers.

{% youtube “tqfWFOAQlqc” "" “start=2324” %}

Stephen Trusheim (moderator): So, there’s always this need to have humans in the loop. If we had perfect systems in some abstract notion, automation could fix everything. But there’s always these exceptional cases. Downtime is an exceptional situation that we think needs humans to be part of that exceptional solution. If it wasn’t exceptional, we would already have it fixed by code somehow. And so a lot of struggles in the SRE organization, I’ve heard, are around defining the right boundaries between automation, scripting, (those kinds of things), and having humans have the right information to step in and not have the system fighting against them. There’s been a couple outages that have been prolonged by the automated systems actually doing the wrong thing or doing something unexpected that humans couldn’t reason about in that exact 2am page. So how do you decide what things should be automated or could be automated, and what could be done or should be done by a human, and what are your criteria for those kinds of things?

Niall Murphy: So I guess there’s a couple of frameworks that would allow you to make those decisions with the key point being that anything that you’re doing should be something you can reverse relatively quickly. My overall intuition in the area is that automation should be coupled to activities you find yourself doing a lot. But, you could probably survey people and stack rank stuff and look at commands executed and so on. You say, “Hmm, we seem to move data from here to here a lot,” and therefore that’s some class of primitive operation which could be effectively automated. In many cases, you can get a significant benefit from that, not just from the removal of human time, but you can also make it more efficient and have it run in the background and have it run at night when people aren’t around and so on and so forth.

So there is, typically speaking, some pretty good return to those kinds of activities. There’s a famous xkcd cartoon which attempts to show some graph or table of the trade-off between how much time it takes versus how much you should automate, which I do not think captures the full value of automation – I’ve said that before. But that’s one framework. Another framework revolves around: For any sufficiently complicated software system, you can draw arbitrary boundaries at very different places and have things that move from being automation to being part of the product and maybe even back again, etc. So many people think of automation as a kind of meta code that’s less important and in the background of the product, particularly from a product engineer point of view. But actually from an operational point of view, from an operability of the system point of view, that automation, if it’s treated entirely separately, often doesn’t have the same kind of code review qualities, like Tina was talking about earlier. That’s also a misfeature. That’s also kind of a cultural misfire of attitudes towards automation.

“In some sense, the highest quality product is one which is autonomous.”

In some sense, the highest quality product is one which is autonomous – where all of the concerns that you would automate have either been brought back into the product and are now product features and don’t have to be maintained separately, which introduces the possibility to drift and decay and so on – or they’ve been designed so that the problem doesn’t actually occur again as a feature of architecture. I think that a lot of software developers today find themselves on the “microservices train,” for want of a better word. And so the idea of taking a complicated problem domain, decomposing it, having intercommunications… all of these things allow the separate treating of problems, which are hard to do in the monolith case. But they end up creating a whole bunch of other problems around observability, externalizing your call stack and so on. Automation is harder in the microservice case than the monolith case. But the kind of problems that you hit against in the monolith case are often things which are insoluble in that paradigm because there is a limit to how you scale.

Stephen: That definitely makes sense. And Tina, I wanted to ask you the same question. Do you have any thoughts on how to keep humans in the loop effectively and how to make sure that human-in-the-loop involvement goes well in these exceptional cases?

Tina Huang: Yeah. I’d like to start with that second framework that Niall’s mentioned of this notion of automation being this back and forth between something that lives externally and then gets pulled back into the product. Part of where you started this conversation and this question was, “How do we think about what should humans be involved in and what should be automated?” And I think a very simple way of thinking about it is: We can only automate what we can understand. So oftentimes I talk to people and then they fixate or they say, “What about the ideal world, Tina. Isn’t the goal to have everything automatically remediated and taken care of for you?” And in some ways that would eventually lead to pulling that all into your product and therefore you have no outages.

“The reason we have outages is not because of the alerts that we are afraid to shut off. Instead, why you’re paged is fundamentally because something new has gone wrong in your system and you have yet to understand it.”

However, I think unless your company has been purchased by a private equity firm and has halted all future feature developments, the reason we have outages is not because of, as Niall’s mentioned earlier, the alerts that we are afraid to shut off. Instead, why you’re paged is fundamentally because something new has gone wrong in your system and you have yet to understand it. And so step one of that is to have very intelligent humans walk through that process. They have to be able to pull in all of the right context and the right information that they need to debug the situation, and on the fly, figure out what’s going on. Then only when a very, very intelligent human can figure that out and you have some understanding can you start to do that next layer, which is: “I’m gonna write some documentation so that maybe someone who has less familiarity with the system can also follow these rules and move into it.”

Then, only when that happens can we start looking at that 80/20 split and say, “Okay, what are the things that humans are reliably doing every time. What actions are consistently done by a human, and let’s take those chunks and automate those.” And that’s sort of the journey that I like to think of as we talk about the path towards humans and automation. And so, I often say that the goal should really be more, “How can we make humans that are on-call less error-prone and more efficient?” Because the idea that the goal is to replace those humans with automation can basically never happen as long as the product is continuing to evolve and you have these new areas of uncertainty that you have yet to understand.

Stephen: How do you characterize that level of maturity of automation starting from zero? Like humans have to do every single action to serve the product versus going up from there? And then how can teams move up that maturity level?

“A team in an organization that has a fully documented human process is potentially more mature than the team that has small bits of automation strewn about with completely undocumented human process.”

Tina: I actually think that the better way of thinking about it is, “What’s the maturity level of the team’s process?” That way you don’t have this concrete line between human process and machine process. So that includes human process, which is the amount that you’ve documented repeatable steps that someone can take, mixed with machine process which is code that automatically runs to do those things. And so, honestly, I think that a team in an organization that has a fully documented human process is potentially more mature than the team that has small bits of automation strewn about with completely undocumented human process. And so, I think the first framework is, “How can I talk about maturity level of my team in terms of repeatable process?” And then it’s like, “How efficient and how error-proof is that process?” So the amount of automation that you need is really only a question of, “How much is this costing to have a human do this?” Because it’s time-consuming, it’s keeping you up at night, etc. And how much is the human injecting errors that actually cause even further incidents and more outages?

Stephen: That makes sense. Niall, do you have any thoughts on that, that journey from humans only to automation?

Niall: Yeah, again, there’s a couple of different frameworks you could use. One of them, in classic SRE book fashion, would be to eliminate toil. So the question is, “What are you doing today across the folks in your org who are doing toil? What kind of proportion of time are they spending doing this?” And stack ranking by that in order to find out what to automate. But I think the question is more than about characterizing levels of maturity of automation. It is also a question about the capability of the team, the capability of what they can author from a software perspective, as well as an observation about where they spend their time today. And so I have various issues with maturity models. I think maturity models are most useful from a perspective of, “Okay, we’re at zero, we’re literally not doing this at all. What does one or two look like?” But I think there’s specifically a problem with if you’re at maturity level three in axis X, what does that mean? As opposed to maturity level four in this other axis? Like, would you rather have that level four or that level three? Like there’s no consistent framework for understanding that today.

Stephen: How have you seen teams and organizations successfully build up more deterministic alerting? So alerts only fire when they had an SLO? Going back to the conversation of very high CPU use, it might mean everything, it might mean nothing, but you don’t wanna get paged in the middle of the night in all those cases.

Niall: Actually, that’s the good news about some of the pivoting to SLO-based approaches I’ve seen across the industry in the past while. I won’t say it’s a deterministic process necessarily, but it’s eminently tractable. One way of doing it is you build what’s called an Outage Footprint Mapping which is a mapping between incidents and the alerts that fired and the severity of the incidents in particular. So you can say, “Oh yes, this terrible thing happened and it flowed from these five alerts, and these not terribly terrible things happened, and they flowed from these 17 alerts.” So you, over time, end up tagging or coming to realize that certain kinds of alerts actually just aren’t that important in terms of real business impact, and so you can end up de-prioritizing them in various ways. Now, the nuances of this depend on the monitoring product you’re using and the political state of your team and so on and so forth, but it is 100% tractable to phase out less important alerting. The real trick here is to start the SLO alerting and maintain the symptom-based alerting (CPU high) in the background, and then successfully wean people off the noisy stuff and onto the real stuff that matters, which again is tractable, but it takes some time.

Stay tuned for part four, as Tina and Niall dive into more advice on progressively automating your stack, including how to address the changing role of ML and AI in the DevOps automation space.