Digital transformation initiatives and hybrid work models have increased complexity, placing the burden on humans to collaborate in a new way and tackle manual toil.
We’re excited to release the second annual State of DevOps Automation Report, an industry survey of 1,046 IT Operations, DevOps and Site Reliability Engineering (SRE) professionals with the role of VP, Director, Manager and individual contributor at U.S. organizations with over 300 employees.
Two trends have taken deep roots in the world of Technical Operations this year, embedding themselves throughout the organization and impacting operations as a whole:
These trends have added complexity to technical operations, increasing the number of incidents and the time it takes to resolve them. Burdened by inefficient processes and longer remediation times, teams have been driven to expand their tech stack, however the vast majority still lack full integration through one tool or platform. Even more, teams are burdened with manual processes, especially when it comes to collaboration and data around human actions, with a majority manually updating information in tickets or their system of record.
Humans are still bearing the brunt of complexity, reporting that the top challenges to resolve incidents are reaching the right people with specialized knowledge, too many manual processes, and a lack of unified communication with teammates (people are collaborating using disparate tools). Teams are struggling to collaborate and communicate efficiently, with half of respondents saying it takes 15-30 minutes to bring the right team members together to solve an incident.
The results make clear that these shifts towards digital transformation and hybrid work models are not temporary, and organizations are actively making moves to counteract complexity and manual toil. Organizations showing the most successful transition into this new era have leaned into both SRE practices and automation as a way to increase operational efficiency, enhance reliability, and reduce MTTR.
While SRE and automation both show clear benefits to operations, many organizations are crippled by the resources they consume — whether that’s hiring SREs or building custom automation in-house. In fact, 39% reporting they have at least one full time engineer for automating DevOps workflows, and 26% have two or more.
Digital transformation continues to be top of mind for businesses, with 90.2% of organizations reporting an increased focus on digital transformation during the last year – just a 3% decrease from the 2021 study. This continued emphasis on digital transformation initiatives coupled with the increase of organizations incorporating a hybrid work model, which jumped to 73.5% from 50.4% in 2021, has driven 73.4% of companies to expand their tech stack, and 98.4% of those reported they will continue to use them for the foreseeable future.
Despite the addition of new tools, organizations still lack full integration of the platforms and services used during incident response, making it more challenging to resolve incidents. In fact, only 24.7% of respondents said all of their tools are integrated through one tool or platform.
DevOps, SRE and IT teams are encumbered by the growing frequency of service incidents that are impacting customers and are experiencing challenges while trying to solve incidents. 62.9% of respondents reported an increase in the frequency of service incidents that have affected their customers. Of those who reported an increase in service incidents, respondents cited the top contributing factors as digital transformation (60.7%), rolling out of new products or product updates (55.1%), and methods and tools for collaboration did not adequately support their team working remotely (49.3%).
58.2% of respondents reported that downtime cost their organization up to $499,999 per hour on average, with 39.7% reporting that the cost of downtime has increased during the last year.
While the tech stack has increased in size, humans are still trying to manage complexity, reporting that the top challenges to resolve incidents are reaching the right people with specialized knowledge (52.9%) and too many manual processes (49.3%). In fact, 52.3% reported an increase in the amount of time it takes to resolve incidents over the course of the last year.
Of those who reported an increase in the amount of time it takes to resolve incidents, the top challenge was a lack of unified communication with teammates (people are collaborating using disparate tools) (45.2%). In fact, merely bringing the right people together to solve an incident is a huge challenge, with 48.6% of respondents saying it takes 15-30 minutes.
Over three-quarters (75.6%) of respondents said there has been an increased focus on site reliability engineering practices in their organization in the past year, and the benefits are clear from the numbers. While 45.7% of all respondents said their team encountered between six and 19 major incidents over the past year, of organizations that have increased their focus on site reliability engineering practices and plan to expand SRE efforts in 2022, 39.8% said their team encountered less than 5 major incidents over the course of the last year.
Automation coupled with SRE practices is showing to be the best combination to enhance reliability. In fact, 100% of VP/Director/Manager SREs that cited a decrease or no change in service incidents said it was because their organization implemented automation technology.
Despite this growing demand for SRE practices and automation, SREs are still doing manual, time-consuming tasks, especially when concerned with human data. Over half of SREs (56.5%) said they manually enter data into an ITSM system or other system or record to keep track of actions that were taken by humans during the resolution of an incident. This fact is even more troubling when coupled with the clear need for human data to drive improvement, with 87% saying believe that systematically mining insights from human data could improve future incident response and improve operational excellence.
SRE and automation are clear winners in the strategy to overcome complexity and enhance reliability. But organizations are experiencing resource strain, with 38.6% of organizations saying they require one or more full time engineers to build in-house platforms or tools for automation, and 26% require two or more.
Organizations continue to confront barriers to automation, with 42.3% of SREs saying the current level of automation is not meeting their organization’s needs and are actively pursuing a new solution to solve for it. The top three barriers to automation include inadequate documentation (56.4%), lack of clarity about what to automate (55%), and share of knowledge is not enough (51.8%).
Again we see how the human factors — the difficulty bringing human knowledge and data into tools and systems — are holding teams back. While 80.4% of respondents reported that automation should let humans use their judgment at critical decision points, capturing the data around human actions and decisions (as well as using it to improve processes) is still a struggle.
A human-centric automation solution that brings humans in at critical decision points while automatically documenting human and machine data, is the most promising way to decrease the manual toil and drive more collaborative workflows.
In an effort to streamline incident management and technical operations, organizations are turning to automation and SRE practices, which are shown to have a significant impact on decreasing rate of incidents. However, humans are still grappling with complexity, from excessive manual toil to ineffective collaboration to a lack of human data needed to drive improvement.
SREs are spending extensive time building automation in-house and still a majority are manually entering data into ITSM systems. The question then becomes, how can organizations expand automation in a scalable way that doesn’t overburden developers? The benefits of automation coupled with the resource commitment it takes to build in-house will lead more organizations to adopt off-the-shelf automation solutions that enable extensibility and deep customization. Moreover, automation platforms must do a better job bringing human data into the fold, without increasing manual toil.
Ultimately, automation that more successfully complement human processes — to enhance collaboration and intelligently capture human data — will not only reduce the tedium of everyday work but also help teams both better manage incidents and learn and improve from them over time.