Implementing effective SRE practices can improve product development, operations, and service levels, delivering more value to customers
This article was originally published on Forbes.com.
While some may look at companies entrenched in legacy software as having to play a game of catch-up, they’re actually set up for success in a way that cloud-born companies are not when it comes to building Site Reliability Engineering (SRE) teams. By moving to the cloud under the guidance of a cloud services team, companies can create a clear understanding of what’s in an SRE’s purview and solidify a consistent set of processes for the operations and remediation of incidents.
Cloud has been the talk of the industry for years. It is now safe to say the shift to the cloud is in full swing as we’ve witnessed more and more legacy software companies moving their systems to the cloud. The pandemic has also accelerated cloud migration for many businesses. As Gartner reported, this trend will continue on an upward trajectory. By 2024, more than 45% of IT spending on system infrastructure, infrastructure software, application software, and business process outsourcing will shift from traditional, on-premises solutions to the cloud. Cloud system infrastructure services are expected to grow from $44 billion in 2019 to $63 billion in 2020, reaching $81 billion by 2022.
Pre-cloud-era legacy software companies have a unique opportunity that goes in tandem with developing cloud adoption strategies, which is to shape their site reliability engineering team, processes, and best practices. A function created at Google, SRE is about driving shifts in how teams operate across a company. SRE teams are responsible for building automated solutions for operational aspects such as incident response, on-call duties, and performance monitoring.
There are 2 contrasting approaches that legacy software companies can take with their cloud strategies that garner drastically different outcomes for SRE:
This approach is scattered and burdens each software service (i.e., software that performs automated tasks or responds to hardware events) and team to adopt cloud services and practices at their own rate and with their own resources. While it provides flexibility and freedom to each service, it lacks consistency in process and knowledge sharing, essentially creating islands of SRE teams throughout an organization. Because these siloed teams are working off of different rules and best practices, the remediation for an incident such as an AWS outage can look dramatically different for each team.
If you have one product or just a few, this approach can work well. However, it is entirely inefficient and expensive when dealing with tens or hundreds of products and services. It requires each team to spend precious time trying to solve the same problems that come with evaluating infrastructure, setting up processes, and integrating the two together. The cost of governance is expensive — siloed teams means knowledge sharing is difficult, and, maybe most problematic, each team does not necessarily have the skill sets to properly set up the infrastructure.
The second approach is to create a cloud platform team responsible for cloud adoption at an organizational level. This team acts as an advocate for moving to the cloud and is responsible for creating documentation, tooling, and best practices that can be used across an entire organization. The cloud platform team becomes an internal service to all other services within the organization, setting up ticketing systems and resolving disruptions or outages on the cloud services. Essentially, the individual teams and services are like the cloud platform team’s “customers.” Hence, the team must have its own set of SREs that respond to incidents and operational needs.
According to a 2019 study, 66% of enterprises already have a central cloud team or cloud center of excellence. And, within the enterprise, most of the responsibility for governing and
optimizing cloud costs is falling on the central cloud team and the infrastructure and operations team. In creating a cloud platform team, organizations have a tremendous opportunity to effectively operationalize cloud services across business functions by establishing powerful SRE practices that cloud-born companies often lack. The platform team serves as a consolidated place to make decisions about the entire cloud stack, which creates the perfect opportunity to inject good SRE protocols and runbooks. A cloud platform team provides consolidations and modern infrastructure operations that are out of the box, which helps develop an SRE team that knows how to manage that shared infrastructure.
Defined and functional SRE practices also contribute to optimizing cloud costs. Because teams use similar tooling by taking the cloud platform team approach, they can consolidate costs.
Implementing effective SRE practices can improve product development and operations as well as service levels, and deliver more value to customers. Cloud services teams take the burden off of individual service teams to deal with reliability so they can focus on building product and innovating faster.
SRE is about driving shifts in how teams operate across a company, not only engineering. As more companies move to the cloud, they have an opportunity to set up their SREs for success, which in turn ensures their software and services better meet customers’ needs and expectations in a world where digital transformation is now the norm.