Why a ‘Reliability Mindset’ Must Be Adopted Beyond SRE

6 SRE principles that IT Operations should embrace to develop a fully mature operational engine

Laurel Frazier
Nov 4th, 2021

As digital transformation efforts escalate, Site Reliability Engineering (SRE) has become an increasingly important function for many organizations. In fact, our 2021 State of DevOps Automation report found that 86% of organizations were planning to hire SREs within the year.

Created at Google, SRE is about driving shifts in how teams operate across an organization. SRE teams are often viewed as the connectors between development and operations, as they are responsible for building automated solutions for operational tasks like incident response and performance monitoring. They typically possess traditional software engineering expertise alongside the ability to look at systems holistically.

It’s clear why the unique skill set of an SRE is widely in-demand. What is often less understood is that the mindset of an SRE is an equally important asset. While not every operator has the technical ability of an SRE, they can begin to adopt the mindset and practices that are core to their work, moving systems towards reliability, resilience, and extensibility. With the right mindset at the helm, your entire operations organization can make a bigger impact.

Reliability — the quality of performing consistently — becomes incredibly important when we view it from an operational perspective. The reliability of a product cannot be separated from the system with which it was created and in which it is maintained; the processes, tools, culture, and mindset of the team and/or organization that built it are integral to ongoing success. Since customers pay for uptime, and companies pay (sometimes significantly) for downtime, good operations may not win you customers, but bad operations certainly may lose you some! This makes reliability a core component of customer satisfaction and ties operations work directly to broader business goals and metrics.

A reliability mindset can be described as expecting the unexpected; planning and accounting for not only the best-case scenario but any and all things that could force a deviation from that ideal path. It requires viewing operations as a responsibility, rather than as a mere team or function. What processes can and should be put in place to address when a deviation from the ideal scenario occurs? What can be done before that happens to prevent incidents or surprises in the first place?

Here are 6 ways non-SREs can begin to adopt a reliability mindset:

  1. Be prepared: Even an all-star SRE team knows incidents are inevitable. Put in place systems and processes that make it easy to detect, mitigate, and prevent problems. Through the continuous refinement of these measures, detection can become proactive and a part of everyday operations.
  2. Embrace automation: Be mindful of what scaling can mean for the tools you are building or maintaining. By embracing automation at key points, you can help ensure that reliability continues as your usage or platform expands while also mitigating toil across the team.
  3. Let the data do the talking: Monitoring and analytics systems provide a continuous, holistic view of infrastructure health and supply data to support detection. Be deliberate about the metrics you track and put them front and center of your decision-making and prioritization of tasks. Use this data to identify customer needs, communicate with stakeholders, and discover gaps that should be addressed.
  4. Debrief without blame: cultivate an environment that is devoid of blame but filled with trust, transparency, growth, and accountability. Recognize that post-incident reviews or retrospectives provide valuable insight and an opportunity for growth. These conversations should be embraced, not feared, because they lead to learnings that prepare your team for future events or surprises.
  5. Close the feedback loop: If feedback is shared in a vacuum and/or not acknowledged or addressed, it might as well not have been shared at all. Even worse, it will also discourage team members from sharing feedback in the future. Develop a process for evaluating feedback and ensuring the team has a clear understanding of the decisions that have been made as a result. As our VP of Product and experienced operations leader Ryan Taylor says, “feedback shouldn’t fail quietly, it should fail thoughtfully.”
  6. Be customer-centric: As mentioned previously, operations work directly impacts customer experience and trust in your product and brand. Seasoned technology executive Bill Scott shared the importance of building roads to the customer in a keynote earlier this year. To paraphrase: it is easy and quite common to substitute other things for the customer instead of taking cues from them directly, which can lead teams down an unwieldy path. When teams truly understand and are immersed in the problems or issues their customers are facing, their desire to problem-solve helps unify them under a common purpose.

Instilling a reliability mindset across your organization will be critical to developing a fully mature operational engine and working with complex distributed systems at scale. If you are curious about how Transposit’s process automation solution can support implementing effective site reliability engineering practices across your organization, request a demo.