Retaining On-Call Engineers in a Pandemic

How to keep the heroes who keep the world running from running away

Ashley Roof · Sep 14, 2020

Photo of person running away into the sunset

Slack: May 12, 2020 - At 4:30pm PST, Slack stopped sending messages. Workers were cut off from their primary source of communication with their teams. It didn’t come back online for almost three hours.

Gmail: August 20, 2020 - For more than four hours, Gmail users couldn’t send emails. Some waited almost six, while other apps in gSuite were also inaccessible.

Zoom: August 24, 2020 - Schools were already struggling to start the school year online, and then Zoom went down on Monday morning, just as students were diving into the strange new world of remote learning. It didn’t come back online for almost four hours.

Outages have always been painful and costly, but in today’s pandemic world, businesses and schools have never relied more heavily on technology to function. An outage today means that daily life for millions of people grinds to a halt when a major service goes down, and yet with unprecedented strain on systems, outages are par for the course. The next major outage isn’t a question of if, but when.

After taking a moment to celebrate the on-call heroes who work all night to resolve issues and restart services that millions of people rely on, I’d like to delve into an acute challenge facing many engineering orgs today: how to keep those heroes from fleeing for greener, pager-free pastures.

There are currently over 100,000 openings for DevOps and SRE positions in the US alone. Before the world moved online, there was already incredible pressure to hire the coveted skillset required to manage complex modern stacks, and now, with years of digital transformation stuffed into a few months, many orgs are in desperate need to keep the engineers they have, all while on-call life is more painful than it’s ever been.

So, how can engineering leaders keep their teams intact while still keeping the services that users rely on running?

A while back, I talked with Eric Mayers, a seasoned on-call veteran who has managed on-call engineering teams at Google, YouTube, YikYak, and beyond for twenty years. He and I met twelve years ago working for Google in Sydney, back when he was setting up one of Google’s first “follow the sun” global on-call engineering teams. In those conversations, we discussed his tips and tricks for managing and onboarding on-call remotely.

Today, we’ll dive deeper into the topic that is keeping many engineering leaders awake at night: how to retain the talented engineers who keep their stacks reliable.

Step One: Balance incidents with innovation

Most people don’t become developers because they enjoy being awakened at 2am to fix a problem.

While some people may enjoy the adrenaline rush of resolving an incident in the heat of battle, chances are that they went into software engineering to build interesting stuff. That is one reason why Google, the inventor of SRE and author of the SRE handbook, places a 50% cap on the amount of ops work any SRE is assigned to do.

According to the Google SRE handbook, “This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable… In practice, scale and new features keep SREs on their toes. Google’s rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development.”

That said, based on his own time managing on-call at Google, Eric warned that the way engineers go about doing development work is important both to the team’s ability to move forward in building product, and in the individual engineer’s satisfaction with their “fun” development time. If the engineer is off on a proverbial island building alone, then every ops distraction hinders progress towards the greater and more interesting goal of product development.

Not having enough time to build isn’t just a problem for the engineer, who may get frustrated by their lack of progress, it is a problem for the team that is relying on that work to move forward. Instead, Eric suggested that at least three people work on a collaborative development project at once, so that on-call shifts, outages, and other ops work don’t hold back the momentum that keeps engineers engaged and looking forward. Additionally, in the common scenario that the on-call engineer isn’t an SRE, they’re instead a production engineer who is on-call for their own product, the ops work can highlight important flaws that can and should be worked into a continuous improvement development cycle, thus making the ops work a natural and useful part of the overarching product strategy.

Step Two: Share the load to avoid overload

Google devotes an entire chapter of their SRE workbook to identifying and overcoming ‘overload.’ They warn that it is “an occupational stress that can cripple productivity. Left unchecked, it can cause serious illness.” They then add that, “when frequent interruptions are paired with external stress factors, a large workload (or even a small workload) can easily turn into perceived overload. This stress might stem from the fear of disappointing other team members, job insecurity, work-related or personal conflicts, illness, or health-related issues like the lack of sleep or exercise.”

Therefore, it shouldn’t be surprising that on-call engineers who frequently work under intense pressure to resolve incidents against the ticking time-bomb of severe business impact are particularly at risk of feeling overloaded, and that the constant stress and unpredictability of the pandemic compounds that issue further. But it isn’t just an issue of team and employee mental and physical health. Overload can lead to burn-out which can lead to attrition, which creates a downward cycle that is particularly dangerous to teams responsible for incident resolution, when losing even one or two people can increase the overload and pressure on the remaining team.

So, how to avoid overload and attrition in these tough times? One important step is to make the on-call engineer not feel so alone. Managing the on-call shifts so that they change frequently enough that someone who is up all night isn’t dreading a recurrence over and over again can help. Being conscious of how much time any individual is on-call and who is on-call during particularly challenging incidents can help spread the pain across the team, keeping the burden off of one person who may break under too much pressure.

Making sure that there are multiple people on-call at any one time can also help. Clearly defining each person’s role in that scenario becomes important, though, or else the benefits of collaboration can be outweighed by the confusion of too many cooks in the kitchen repeating parallel steps and causing greater complexity instead of relief. Defining the incident commander and communications lead is an easy start, taking the pressure off of the primary troubleshooter to share updates with stakeholders. Transposit's incident command dashboards make this process even easier by keeping track of the whole incident timeline automatically, while the on-call engineers are able to tag and share specific stakeholder updates without having to manually write out the details in a ticketing system like Jira. Transposit's automatic incident timelines also allow on-call engineers to go back to sleep after they’ve finally resolved an incident without losing any of the context of the many steps they completed in their triage. Instead of reconstructing the incident manually, the full record of everything they did, including chats in Slack, stats pulled, and remediation actions taken from runbooks, is ready for post-incident analysis... after everyone has gotten some sleep.

Step Three: Only page on issues that matter

Scrambling out of bed at 2am to an alert, only to discover that there wasn't a real issue is a painful reality faced by too many on-call engineers. While on occasion that result may be a reprieve from having to groggily troubleshoot, frequent disruptions cause alert fatigue. Alert fatigue, which is often studied in doctors and nurses, leads to burn-out and more alert overrides, even when an issue turns out to be serious. It turns out "the boy who called wolf" had some meat to it.

In their recent conversation, co-author of the Google SRE handbook, Niall Murphy and our CTO, Tina Huang, discussed their views of the top challenges facing engineering organizations. Alert fatigue was high on the list.

"We actively do damage to ourselves in the organizational/socio-cultural end of things when in the act of deciding to page we are insufficiently discriminating about what we’ll choose to page on,” Niall commented. “Often in organizations and in teams, there are pages which happen because three and a half years ago someone said something might be a problem - ‘CPU high’ - and now in the middle of the night the CPU is high and you get a page, and you just file it and move on because CPU high hasn’t been correlated with any individual incident since then, but because it happened once we have to preserve it forever. So there is a lot of mental frameworks around the management of alerts and generation of pages that flows from inappropriate conservatism about what is actually contributing to alerts.” He went on to suggest SLO-based alerting advocated by Alex Hidalgo in his upcoming book Implementing Service Level Objectives.

While reducing false alarms requires strategic planning around what metrics matter most to the business and reliability of services, its impact on maintaining the work-life balance of the team is critical to retaining on-call engineers and keeping the support you already have intact.

Step four: Offer the well-deserved heroes’ welcome

Too often on-call engineers bear the brunt of the pain when outages are causing user-facing issues. Even in a culture that focuses on objective measures like SLOs and MTTR to judge performance, the personal sacrifice that on-call engineers are making to their own quality of life to keep a service running for the greater good should not be overlooked.

Many organizations offer on-call bonuses, but in addition to monetary or vacation incentives that can be offered to balance out the challenges of being on-call, heartfelt appreciation and accolades can go a long way.

So, regardless of whatever else you are able to do from an organizational perspective, make sure to take a step back, reach out to the on-call engineers who are bracing for the next storm, and thank them. It just might make everyone’s lives a little bit better – something we can all use in the heat of 2020’s many challenges.

Try intelligent runbooks and simplified incident resolution