"Automate Everything" is a popular mantra in modern DevOps and SRE, yet human judgment is still a key factor in every stack's management, from daily operations to incidents.
Last week, instigator and co-author of the Google SRE books, and Azure SRE lead, Niall Murphy, and our Transposit co-founder and CTO, Tina Huang, met up via Zoom from California and Ireland for a fireside chat about how to make the most of automation in today’s complex engineering environments across DevOps, SRE, and beyond. They touched upon a number of interesting topics, from how the movement from on-prem to multi-cloud has changed the role of automation in a stack, to the constraints and considerations around using machine learning to replace human judgment.
In the second installment of this series, we’ll continue on with our favorite highlights of what Niall and Tina had to say around adding progressive automation to your stack while reducing the risk of automations gone awry.
Stephen (moderator): In the old world, IT set up the machines, engineers wrote the code that goes on the machines, and there was more of a separation of church and state than we see today with things like Infrastructure as Code where engineers are making the operational decisions about what kinds of machines to be running on what kinds of workflows. Can you speak to this ecosystem, about how people have to work and blend together across the organization to achieve uptime and SRE goals?
Niall: I’ll quote myself from earlier in the conversation - what can you successfully ignore? What the last couple of decades have shown us is that in situations as complexity grows, a leakier abstraction generates more errors. Complexity is harder for people to work with, and error rates go up in various ways... What has been a journey that we’re still going on across the industry is that it’s a technical configuration and those concerns are in the technical domain, but behind the scenes there’s a social/organizational component about how to address this. The thing we discovered, from about the 1950s when mainframes became popular after WW2, this split between development and operations originates and persists for decades, and we make the assumption that you can successfully split or successfully ignore a given set of questions and address them to different groups. But it turns out that you can’t do that, or you can do it to a certain point, but the abstraction that the split represents itself becomes too leaky for a modern organization to successfully make progress. This is all about feedback loops, all about deployment time shrinking, stability ensured not by batching up changes and carefully controlling things and having a change control board and all the wonderful stuff that Nicole Forsgren writes in Accelerate (I fairly recommend that book if you haven’t read it already), but rather by having frequent smaller changes and having feedback loops that aren’t broken.
"I strongly believe that a more holistic view towards the construction of software is hugely important."
If you’re still in a position where your development teams and operations teams are viewing the world in very different ways, and essentially the operational consequences of the constructed software are an externality (in the economic sense) on the production team, and the operations team suffers - then what you have is a product that is less good than it could be and less agile. I strongly believe that a more holistic view towards the construction of software is hugely important because in the industry feature development is hugely important, and there’s everything else that includes security, liability, performance etc. and it turns out if you treat those separately - and there are social components here - (to what extent can different groups influence different stages of the software development lifecycle) - all of those things are, in the DevOps world, hopelessly - or hopefully - intertwined. We should accept that actually we’re searching for simplicity by drawing hard boundaries in the organization socially and technically, but actually what we need to do is unify more.
"We should accept that actually we’re searching for simplicity by drawing hard boundaries in the organization socially and technically, but actually what we need to do is unify more."
Stephen: What are the main challenges that organizations face and how can they get closer to this world of everything’s automated, everything’s perfect, and the on-call team never has to wake up at 2am again?
Niall: I’m not sure I’d couple “everything’s automated” with “everything’s perfect” - they aren’t necessarily the same thing. So, on the problem generating side we have everything to do with highly dynamic environments - like running an internet service (like as soon as you open a TCP port on the internet you get stuck) - that’s just what happens. So there’s a lot of complexity and leaky abstractions that are involved in the running of services in general, and I think I’d choose to highlight two things specifically:
"We actively do damage to ourselves in the organizational/socio-cultural end of things when in the act of deciding to page we are insufficiently discriminating about what we’ll choose to page on."
The first is that we actively do damage to ourselves in the organizational/socio-cultural end of things when in the act of deciding to page we are insufficiently discriminating about what we’ll choose to page on. Often in organizations and in teams, there are pages which happen because three and a half years ago someone said something might be a problem - “CPU high” - and now in the middle of the night the CPU is high and you get a page, and you just file it and move on because CPU high hasn’t been correlated with any individual incident since then, but because it happened once we have to preserve it forever. So there is a lot of mental frameworks around the management of alerts and generation of pages that flows from inappropriate conservatism about what is actually contributing to alerts. I was in this camp as well. I was on a team with CPU high, CPU very high, CPU really too high style alerts - and these were viewed as providing valuable intelligence for the product development team. They might have provided it, but there was no need to page anyone about it. That’s my advertisement for SLO-based alerting which you can read in the upcoming SLO book by Alex Hidalgo.
"There is a lot of mental frameworks around the management of alerts and generation of pages that flows from inappropriate conservatism about what is actually contributing to alerts."
The second thing I’d choose to surface is that a lot of the time people are scared of automation, especially senior decision makers, because there’s some intuition (that isn’t wholly wrong) that the environment is complicated and automation can end up doing something bad as per our previous discussion. So they think “you are paying people to be on-call so they should be on-call, damn it!” So they are addressing the complexity problem and the dynamic environment problem by throwing people at it. And the difficulty is that human attention is a finite thing, a non-renewable resource.
"...the difficulty is that human attention is a finite thing, a non-renewable resource."
If you just treat your teams in this way, they eventually leave. It’s not a great problem domain for human cognition to be applied to. That’s where automation comes in as a support for human activity and in particular for the simpler things you can do to offset certain kinds of problems. There’s 80-20 relationships there, where a terrible thing happens to a data center but your automation handles the redirection of the traffic automatically, you don’t need to do anything about it, no one needs to be paged. It’s ward medicine instead of emergency medicine and you can look at the problem in the relative safety of business hours.
"There's 80-20 relationships there... It's ward medicine instead of emergency medicine..."
Tina: To start with what Niall talked about with this 80-20 relationship between human activity and automation, I think that that’s the core of it. It’s counterintuitive, but the path to trusting and building more automation is focusing on smaller chunks of automation. One of the problems you have is that if you try to automate everything, it tends to be very fragile. It’s hard to create that script, and the worst case is to have an automation try to remediate, and then the on-call person is stuck with not understanding why the automation didn’t work out. So you’re much better off with that 80-20 relationship where you leave the parts that are changing quickly and product-dependent on the human, and then automate things like making it faster for them to restart servers or move traffic to a different data center, etc.
"The path to trusting and building more automation is focusing on smaller chunks of automation."
But part of where we started this conversation was with the question of what is that pathway to that dream of automation, and the path is: what is automation? It is a codified process for machines to execute. But what does it look like for humans to do that? That’s documentation. Documentation is fundamentally a codified process for humans to do things. And so, how do you get to the point where you can have a machine do it? Step one is to document it and test it with humans, have it reliably consumed, and then over time introduce these small chunks to this 80-20 split of automation that’s mixed into your documentation. The simplest example of this is a Confluence doc pointing to a script that humans can run, and slowly increasing the amount of scripting that happens. But then the other aspect of the fear of automation is that oftentimes as these live as scripts, it’s not fundamentally the collaborative environment for building automation that we have in other tooling. Sometimes people have their own copies of scripts on their machines and it may or may not be up to date. There isn’t an easy way for them to have that documentation and commentary and code review for that automation. So the other piece is to have a SaaS service in a hosted environment for the automation to run. It’s what we see with Terraform Cloud or managed Terraform solutions or CircleCI for deploys, having these consistent environments to actually run this automation so that it feels more trustworthy. You don’t have to worry that this time because I upgrade my Python, on-call gets paged and the script fails because of some Python error.
"The other aspect of the fear of automation is that oftentimes as these live as scripts, it’s not fundamentally the collaborative environment for building automation that we have in other tooling."
Stay tuned for part three, as Tina and Niall dive into more advice on progressively automating your stack, socio-cultural tips for your eng org, and how to address the changing role of ML and AI in the DevOps automation space.