Onboarding On-Call Engineers in a Pandemic

Setting Up Remote Teams for Success Across DevOps, SRE and Beyond

Ashley Roof
Jun 30th, 2020

“Welcome to the team, Neil! There’s your desk. Your mentor will be sitting next to you, and you’ll get to meet the rest of the team at lunch. After that, you’ll be shadowing Maya, who’s been on the team for four years. She really knows the ins and outs of the legacy systems, and then you’ll be shadowing Brian. He’s the one to ping if any of our automated scripts run amuck…”

Just a few months ago, Neil was set up for a successful onboarding process as the newest SRE on the team. His manager had developed a well-oiled onboarding machine:

  1. Introduce him to his colleagues with some 1:1s and social time to help them build rapport and understand each other’s areas of expertise.
  2. Structured mentorship for him to observe his senior teammates manage their daily operations and incidents.
  3. Supervised participation in their triage process throughout a series of incidents before taking on the reins of an on-call shift himself.

But in today’s pandemic world, many on-call managers and engineers across DevOps, SRE, and traditional IT are facing a brave new world of entirely remote onboarding. How do you build new relationships entirely over Zoom? And how do you make the shadowing that we rely on so heavily equally effective from afar?

While remote onboarding is uncharted territory for many teams that are used to the ease of in-person interactions, it isn’t foreign to global or distributed teams who have been building strong on-call onboarding processes for years.

A few weeks ago, I talked with Eric Mayers, a seasoned on-call veteran who has managed on-call engineering teams at Google, YouTube, YikYak, and beyond for twenty years. He and I met twelve years ago working for Google in Sydney, back when he was setting up one of Google’s first “follow the sun” global on-call engineering teams. In that conversation, we discussed his tips and tricks for managing on-call remotely. Today, we’ll dive deeper into a particularly impactful area - setting up new on-call engineers for success.

Step One: Before a New On-Call Engineer Starts

Good documentation has never been more vital. It helps the new engineer stay on track without a team sitting around them, and it takes the pressure off of the more senior members of the team who don’t have the bandwidth to sit on long Zoom calls answering a laundry list of questions.

First, you’ll want to make sure that your current on-call process is documented and up-to-date, including escalation paths, incident determination, and mitigation expectations. This is especially important if the new hire has never been on-call before. In that case, also make sure they understand the practical details of being on-call, including what software and hardware to have accessible at all times (ex. they will need their laptop, not just their phone) and what the expectations are for responsiveness, especially for after-hours shifts.

Make sure that the team’s runbooks are up to date (check out our prior post on writing good runbooks here) and that all documents explaining your software environment are consolidated in one place, clearly labeled, and easy to navigate. The clearer and more robust the documentation library is for the new engineer, the less they will need to reach out to ask their remote colleagues for basic details. This will leave their precious Zoom mentorship time free for the questions and shadowing that requires direct human communication.

Other steps to take before the on-call engineer starts:

  • Create a learning plan and cheat-sheet with the most important resources/links.
  • Create their new employee accounts and give them access to the systems they’ll be supporting.
  • Pair them with an experienced mentor who is a reliable teacher and team liaison.
  • Set up introductory Zoom/video conference calls with the main people they’ll be working with.
  • Make sure you have clearly documented criteria for how their performance will be evaluated.

Finally, review your backlog or task lists. Note small, contained projects with business value. These are good, low risk projects for the new on-call engineer to work on as they become familiar with your systems. To ensure success, you should calibrate the tasks to the experience level of the hire. Some options to consider:

  • Exploring a new technology and developing a proof of concept
  • Upgrading a tooling library
  • Debugging and patching a non-critical intermittent bug
  • Documenting a deployment pipeline

Step Two: Making Remote Mentorship Work

Mentorship is key to helping any new employee understand the ropes and feel welcome, but in the world of DevOps and SRE, it can be especially important for helping the new hire ramp up. But how do you make that work remotely?

First, assign one primary mentor as the main point of contact for answering the new hire’s questions. Set up a plan with the mentor before the new on-caller starts, ensuring they know the scope of their responsibilities. Make sure that they have clear expectations (and boundaries) about the frequency of Zoom calls with the new hire, their responsiveness on Slack, and guidelines for shadowing/screen sharing sessions. Aligning before the new hire starts will ensure that the new hire gets the resources they need to ramp up, while also providing support for the mentor as they balance their role with their other responsibilities. They should also feel empowered to connect the new hire with the right people on the team to answer specific questions, so that the new hire can start to build the relationships remotely that they would otherwise be building in person (Check out our prior post on managing on-call in a pandemic for tips on building team culture remotely).

In addition to clearly defined expectations for mentorship, choosing the right mentor can be the key to a new on-caller’s successful integration into the team. As often as possible, mentors should be volunteers who enjoy helping others and are confident at handling their own workloads.

Additionally, ideal mentors:

  • Are knowledgeable about the systems, processes, and team culture
  • Foster a sense of community within the team
  • Demonstrate excellent judgment and a cool head in tough situations
  • Have shown that they are able to give constructive feedback
  • Are able to explain complex concepts to people who know less than they do
  • Can communicate well remotely over Zoom and Slack

Being a mentor for new employees can be an excellent career move for the mentor if they use the experience to develop management skills. Rewarding good mentors with positive feedback and a path to team leadership can help grow the pool of people who see value in stretching beyond their IC roles, making a mentorship program a win-win for everyone involved.

Step Three: Reinventing Shadowing From Afar

No matter how good on-call process documentation is, human judgment plays a part in any incident response. Shadowing real incidents allows on-call engineers to cultivate that judgment. Though you never want outages or issues to affect your customers, you do want new on-call engineers to experience such incidents in a supportive environment before they are primarily responsible for handling them.

Getting the new on-call engineer started with the on-call routine as early as possible helps them gain confidence and understanding. It’s a good idea to have the new on-call engineer start their shadowing process with their mentor early, so that they can understand how the documentation they are studying applies in the real world. To do that remotely, you need to create a clear plan for how they will be able to observe and shadow while their mentor is working to resolve an incident, without negatively impacting resolution times (MTTR) or service level objectives (SLOs).

The new hire should first shadow their mentor during common daily operations, so that they can get a sense of how the systems typically work before they learn how to troubleshoot them. After the new hire has had enough basic exposure, they should shadow their mentor through several incidents. To do this remotely without distracting the mentor from their task, include the new hire on the alert page and the incident alert channel, as if they are on the collaborating team. As soon as the mentor logs into their laptop, they should open up a Zoom or Slack call to screen share, and talk through their processes just as if they would if they were being shadowed in person. Setting clear guidelines for the new hire about how to communicate with their mentor during an incident before they start shadowing can help them avoid distracting their mentor when downtime and customer impact should be their primary focus.

After several shifts of shadowing their mentor, the new hire should branch out to shadowing other senior members of the team via the same process. Watching multiple on-call engineers will help them understand how different people approach similar problems, which will help them develop their own style and judgment.

Making Remote Onboarding Easier with Transposit

Transposit’s Mission Control incident command center can help facilitate collaboration between a new hire and their mentor by keeping a real-time record of all the mentor’s actions and communications in one unified interface. Via Mission Control, new hires can see exactly what interactive runbooks their mentor is using, what systems they’re accessing, how they’re interpreting the data, and what resolution actions they’re taking.

Because Mission Control keeps complete, searchable documentation across daily operations and incidents, the mentor and the new hire don’t have to worry about reconstructing their incident timelines afterwards for a post-mortem discussion. Instead, they can discuss the self-documented post-mortem provided by the Transposit system, and focus 100% of their limited videoconference time on how to learn from the incident.

Most importantly, when it is the new hire’s turn to take the reins, they will not be alone. They will have Transposit’s interactive runbooks to guide them through the “choose-your-own-adventure” triage process. The Transposit system augments their knowledge with the full context of prior incidents and enables them to take direct action commands, like reverting recent code commits or restarting EC2, with the touch of one button.

With the help of Transposit, the on-call onboarding experience is easier, faster, and more successful for the new on-call engineer and their team. And in these pandemic times, we can all use a little extra help.