Your phone dings with a page. Latency is too high, bordering on another outage. The data in the logs isn’t adding up and you don’t know why, but you’re sheltering-in-place alone with your cat. Suddenly, the silence is getting to you.
You rush to Slack, but you’re the only person on your team who’s online. "Is it that new microservice that just went into production loading the database? Of course there is no runbook for that thing, just the out-of-date design doc, still with unresolved comments. Argh!"
In normal times, this scenario might not be so bad. You’re in the office, there are other engineers around, and even with poor documentation and an archaeological mess of old systems to unearth, someone on the team has usually been around long enough to offer a helpful clue. But now, working from home with inconsistent access to the team, poor documentation, and unprecedented strain on your systems, what used to be a standard incident is now an SLA nightmare.
Why does it have to be like this? The short answer: It doesn’t.
I talked with Eric Mayers, a seasoned on-call veteran who has managed on-call engineering teams at Google, YouTube, YikYak, and beyond for twenty years. He and I met twelve years ago working for Google in Sydney, back when he was setting up one of Google’s first “follow the sun” global on-call engineering teams.
Eric offered his practical advice from the early days of building a successful remote on-call engineering organization.
The current Coronavirus pandemic has forced everyone on the planet to acknowledge that what were considered “theoretical” disaster preparedness plans a couple months ago are now influencing our day-to-day lives in ways we couldn’t imagine before.
The good news is that well-run on-call engineering organizations should have a leg up, because best practices for pandemic preparation are actually the same for any on-call disaster scenario, which is what DevOps and SRE organizations were designed to do.
In good times and bad, teams who are on-call for a service need consistent standards and practices, and they need to understand what makes them operate well as a team. Until recently, most teams relied on in-person socialization to establish their on-call culture - developing a sense of who-knows-what, reading body language, and even overhearing hallway conversations. These are all hard to do while sitting alone in your pajamas with your cat, even with the explosion of Zoom meetings.
That said, building strong process and on-call team culture isn’t foreign to managers who have built multi-site or remote on-call teams in the past, and the technology for connecting people remotely is a hell of a lot better now than it was twelve years ago, back when Blackberries were the height of sophistication and international data service was a golden luxury reserved for very few elites, even within a tech behemoth.
So, how did they make it work?
Use an on-call log: The on-call engineer should create a brief entry when they take over and note things they plan to do, make progress in, abandon, learn, complete, etc, and finally an off-call entry. A record of their duration on-call and an estimate of how much of their effort was "on-call" work versus not might be helpful to understand the on-call load. These logs should be lightweight.
Structure the hand-off: Conduct a hand-off meeting with your on-call log document via a quick call or video conference. The incoming on-caller should create their "taking over from Bob" entry in the log and the exiting on-caller should write their "handing off to Alice" entry. Bob can share with Alice what happened, any active issues or FYIs, and together they should agree who will notify the next on-caller that their shift is coming up. Additional members of the team are encouraged to listen in on this call, but it shouldn't be required, except for any new members of the team who are currently in the onboarding/training process.
This process might seem heavy but it ensures there is a good history recorded, and the "hand-off" ensures the baton is passed (and not just thrown into the air). If hand-offs are scheduled at a time when having a phone call is unreasonable (like midnight) improvise an alternative, or change the schedule so that the hand-offs are done when everyone is likely to be awake.
Prevent knowledge siloing: Don't allow any singleton owners or experts. Zero critical things on your team should be doable by only one person. This is always good advice (not just during a pandemic), and is sometimes referred to (somewhat morbidly) as the "bus number" for a team - how many people can be hit by a bus before the team can't do its job. Teams will often fall into this pattern unknowingly as team-members establish their own niches - ”everyone knows that Neil does the PagerDuty configuration, it’s faster if we just leave it up to him.” The common problem is that if Neil leaves the company or is even out sick and no one else knows exactly how he did it, the team can be left in a jam. The risk is even higher during a pandemic when there is a real chance that some of your team could become suddenly and seriously ill. Bottom line: make sure multiple people know all the important stuff.
Keep good documentation: One of the easiest ways to avoid knowledge siloing is to have good documentation, including a library of runbooks/playbooks that are kept up to date and contain the key information necessary to troubleshoot common problems. Many teams don’t have the time or the energy to keep their documentation up to speed, but when an on-caller is in crisis mode, it is often the first place they turn. In prior posts, we’ve given tips for how to write good documentation as an SRE and how to reward it within an organization.
Yet, most engineers still drag their feet at the unpleasant task. That’s where Transposit’s automatic post-mortem timelines can really make a difference, especially when paired with a standardized process to keep the team up to speed on learnings after each incident. Using post-mortems as a learning tool can also help mitigate the problem of knowledge siloing and feed prioritization of fixes or features that will prevent recurrence, so it is a win all around, whether or not there is a pandemic going on.
Empower all engineers to "tune up" monitors, alerts and tests: This stuff can sometimes become a niche duty and it's good to have a "lead" person to own it and keep it from exploding into a haphazard mess, but make sure that everyone knows how to, and is encouraged to, improve the "problem identification and alerting" system, and why it's important. This will give the team more flexibility to act fast in a crisis, and again, will reduce the risk posed by losing an important individual within the team if they get sick or leave the company.
Capture knowledge with video conference trainings and post-mortems: Consider recording trainings and important de-briefs. The options today compared to twelve years ago are vast - Zoom has an easy setting for recordings , and apps like Otter.ai can convert audio into a decent transcript automatically. This can help you keep track of institutional knowledge along with practical details that don’t otherwise make it into written documentation. Searchable written transcripts may help on-callers who have reached the end of their “choose your own adventure” workflow checklist and still haven’t found a solution, especially if they are unable to reach the more senior members of the team whom they normally would have cornered in-person for help. Transposit’s searchable incident timelines also help this process, since on-callers can search all the Slack messages associated with prior incidents, not just the bullet-points captured after the fact.
Process isn’t everything. Some of the most critical wins or losses come from having a cohesive team. Everyone should understand how the team operates, and each person should feel like a valuable player.
Here are some ideas for how to cultivate this remotely:
Work Together: If members of your team are feeling disconnected, try having a group sit on a conference call while working for a while. Hearing teammates typing can make the work feel more like it does in a shared office, and creates an opportunity for “hallway chat” or “throwing a question into the air,” like they normally would. While some might cringe at the idea of sitting on a mostly quiet call, having the opportunity to talk through questions in the moment in a casual conversation is otherwise hard to replicate, even with communication platforms like Slack.
Play Together: Many people are feeling the weight of the extreme social changes that have resulted from mass sheltering-in-place. While the fact that this is a shared global experience is unique, remote workers and people in satellite offices have felt this same strain for years. At Google, geographically remote SRE teams would sometimes bake the same cupcakes, trying their best to get the product to be identical across countries.
Having lunch or happy hour on a video conference provides a forum to simulate some aspects of the social experiences that people are currently missing, and can provide a unique opportunity for people to connect with coworkers in a way they wouldn’t in person. Consider doing a “show and tell” happy hour, in which people share fun items or pets that they wouldn’t normally bring to the office. This could be something reflective of the culture where they live, like antique snowshoes, a funny hat, or a local coffee or beer from a favorite brewer. Trivia hour or multi-player games can also create opportunities for people to build friendships that will help them work better together the next time they’re rolling up their sleeves to fix an incident.
A lot has changed in the world in the last twelve years, but humans using technology to interact and solve new problems has stood the test of time. We are more global now than we ever were, and with a pandemic or without one, we are better set-up in 2020 to face the unknown crisis together, even if we are doing it in our pajamas.