How to orchestrate complex systems with automation
This blog post is part two of a five-part blog post series covering the DevOps Spectrum of Automation. If you want to learn more about the series, see the inaugural post here.
Last week, we talked about our longtime, trusted friend – scripts. Now, let's take a step further towards more automation in the DevOps Spectrum of Automation. In this second post, I want to focus on orchestration – a way to easier manage complex systems at scale.
The Terry Cox says it nicely:
"Think of these [orchestration] tools as an orchestra of musicians. The orchestration tool is akin to the conductor. The conductor ensures the right number of instruments are there and that all of them are playing correctly. If there is an issue, an orchestration tool will generally remove the misbehaving instrument and replace it with another. Orchestration tools are usually focused on the end result and help to ensure the environment is always in that "state."
As our infrastructure and cloud environments have scaled, the need for orchestration has increased. In a previous post, I talked about how our dependencies have significantly increased in the last couple of decades of development. As Deloitte's Chief Cloud Strategy Officer, David Linthicum, says, complexity is "cloud computing's Achilles heel."
It's tricky to get everything set up in the right order and playing nicely with each other. Very few teams deploy a single service on a single machine anymore, and there are whole clusters of applications, multiple data centers, various platforms, clouds, and more pieces of the infrastructure puzzle. Therefore, in many of our use cases, we've outgrown scripts because they do not scale.
When we want to maintain a healthy state and optimize processes to ensure coordination between different parts of a system's architecture, how do we help our teams deploy more quickly?
Orchestration attempts to make running complex systems at scale easier. Orchestration automates the execution of numerous processes in a proper order to try to minimize production issues and increase time to market. A process may involve several tasks that need to be performed.
A common example of using orchestration is managing a multi-tier architecture, sometimes called n-tier architecture. For example, you have a 2-tier architecture with a pool of web servers that use a database tier. Orchestration can ensure the setup processes for the database tier have run, and the database is available before the web server processes are started. Other examples might include orchestrating load balancers, DNS, monitoring, firewalls, routing, and more.
Orchestration usually uses some policy, playbook, recipe, or configuration, which is either declarative or procedural. We will cover this in a bit. I often feel it is missed that this is an excellent way to represent complex systems through a topology model. A blog post from Chef in 2014 describes this as "a description of the order-of-operations across a group of machines. A common example is provisioning a database, cache layer, multiple application servers, web servers, and load balancer(s). This model will include distinct technology components that must interact, are interdependent, and more often than not the provisioning is accomplished through a very specific order." Many orchestration tools use a domain-specific language (DSL) to do this.
It can get tricky to talk about orchestration tools because many tools like Red Hat Ansible, Terraform by Hashicorp, Chef, Puppet, AWS CloudFormation, and others blur the lines between provisioning, configuration management, and orchestration.
Many provisioning and configuration management tools also have orchestration tools, which makes it even more confusing. One of the most significant differences is that orchestration tools are capable of information gathering and systems intelligence. They each have their place within your toolset, but a configuration management tool won't do everything orchestration can. If you want to see more comparisons, check out this post on What's Deployment versus Provisioning versus Orchestration and Peter Souter has a great talk on Provisioning vs Configuration Management Deployment vs Orchestration from FOSDEM 2018.
While some may not agree that orchestration is automation, when you view automation more as a spectrum with different levels, it makes sense that orchestration has a place in the spectrum.
Often automation is considered to be only a single task instead of many processes across multiple different systems like orchestration is. Still, often each of these processes includes automating tasks without human involvement. While it might be a layer on top of automation, at its core, it is automating automation.
Red Hat describes automation as "reducing or replacing human interaction with IT systems and instead using software to perform tasks in order to reduce cost, complexity, and errors." Does orchestration reduce or replace human interaction with our systems? Yes, we use it to orchestrate systems that would be extremely difficult to do manually. Is it using software to perform (many) tasks to reduce cost, complexity, and errors? Mostly yes, it is ensuring tasks happen in the correct order to reduce costly production errors.
With the idea of a spectrum of automation in your mind, where does orchestration sit?
First, we need to understand that many orchestration tools are declarative. Declarative orchestration declares what is needed, not the process of getting there. With declarative orchestration, we cannot fully see what is going on inside, almost like a black box. We are given promises by the tooling provider based on what we have declared, but we don't fully know the steps it will use to get there. The thermostat in most of our houses is declarative. We say we want the temperature to be 75F and it regulates it for us. Not all orchestration tools are declarative. Some orchestration tools are procedural, which require you to lay out the exact steps in the code.
For example, in a declarative tool, if I wanted 5 EC2 instances when the tool is run, it would try to maintain that state. While in a procedural tool, if you wanted 5 EC2 instances and there were 8, you would need to specify that 3 need to be removed.
Terraform is strictly declarative and can be used for orchestration. We use it at Transposit as a reproducible and reliable way to orchestrate our systems! Another tool example is Ansible, which can do both procedural and declarative. Ansible is another good example of a configuration management tool that has tooling that can also do orchestration.
Image from my Failover Conf talk on "Human-in-the-Loop DevOps"
Orchestration sits to the right of scripts. The more declarative a tool is, the more automation it can include, but even procedural orchestration is more advanced than scripts. It is built on top of tooling that offers more automation than most scripts would. Also, no matter if orchestration is run from the command line or triggered in a more advanced, automated fashion (I.e. pull requests, continuous deployment), orchestration does not incorporate machine learning and artificial intelligence. Hence, it isn't to the far right in the spectrum. It is still the responsibility of the user to decide what processes are needed.
As you increase automation and move to the right in the spectrum, what is the effect of orchestration on incident response? This is an important question to ask as you increase the amount of automation in your system. The more things that are masked away, the harder it is to quickly have the background information on what is happening when you are paged. This black box can occasionally do more harm than good.
Have you ever had to fight against auto-scaling that you didn't fully understand in the middle of an incident? It can be hard to trace through the decision trees of how your system got to a particular state in the middle of an incident when orchestration is automating specific processes. This can make it harder in the post-incident review too.
When responding to an incident with orchestration, there's also a difference between organizations that are practicing DevOps versus a more traditional, siloed IT operations structure. In organizations that practice DevOps, developers might have more of a grasp on how the orchestration is operating, while in a siloed IT operations structure, the on-call product developers might have less visibility into the underlying orchestration and infrastructure, making incidents harder to resolve.
Either way, to reduce the impact and frequency of incidents, you have to find the right balance of automation and understanding the orchestration design. Then, you should ask yourself during each incident post-mortem, "How did this automation help or hurt our response to the incident?" The answer should feed into your incremental plans for system improvement.
A move towards orchestration requires a cultural change for how we operate our systems, just like with DevOps. This change requires us to codify our infrastructure in ways that are more advanced than the scripting we've done in the past.
For example, you might work in a large organization practicing DevOps that wants to allow teams to self-service their clusters. Without in-depth infrastructure knowledge, this can be tricky for developers to do well. Centralized orchestration tools allow you to codify the knowledge of how to build and scale your infrastructure, augmenting the knowledge of product teams. A move to this model can be a significant change for some.
It's essential to make this change gradually. Waking up one day and expecting all product teams to start operating their infrastructure with the help of orchestration tools is impossible. Try to start orchestration with projects that will have significant business value. If the only goal you are measuring is to speed up how quickly tasks get completed, you will not receive the true business value that orchestration provides.
As we move along the spectrum of automation, the potential for improvement of how we operate our systems increases as does the potential for more harm. It's important to consider the tradeoffs we might experience as we bring in more automation while also improving the manual processes that slow us down.
Let me know what you think about orchestration’s place on the DevOps Spectrum of Automation. Tweet me at @taylor_atx. Watch out for the next post of the series on an unexpected form of automation! Meanwhile, check out the first post in the series on our longtime, trusted friend - scripts.