Are you sitting by the fire sipping cocoa, worrying secretly about whether you’ll have to put out an engineering fire any minute?
If only everyone could enjoy a well-earned break from on call life, the week would be more relaxing, but there is no rest for the technology that keeps our world running. This week certain industries are reaching a holiday high, and DevOps, SREs, and on call engineers everywhere will have to don their invisible superhero capes and keep us humans and our devices connected, just like they always do.
At Transposit, we know the pain of on call ourselves, and so we’ve banded together to come up with some of our top tips for making holiday on call shifts as painless as possible.
Whether on-callers are planning on having a Rockwellian Christmas, lighting the menorah with Bubbe, visiting family abroad, or taking a tropical trip with the two federal holidays padding their vacation request, there is one consistent truth about the holiday week between Christmas and New Years: Mass exodus from the office. So, before we even get into the challenges of answering a page when no one is around to help you, how can you avoid that conundrum in the first place?
On call engineering managers should be careful to adjust on call schedules so that the same people are not on call during multiple peak times. Which times are most painful are up to the individual, so giving each person the opportunity to sign up for what works for them first is always a best practice. Perhaps someone doesn’t mind being on call Christmas morning, but absolutely can’t do it on New Years Eve. Start by letting people pick what is ideal for them, and then fill in the remaining gaps, being sure not to unduly burden any one member of the team.
When making the on call schedule, make sure that there are primaries and secondaries who are committed to each shift, since they will likely be working together without quick access to the rest of the team when something goes wrong.
To reduce the pain further, consider upping on call bonuses for the entire week as an incentive for taking on undesirable shifts, and once everyone is back in the office in January, be sure to acknowledge the on-callers who responded quickly to keep the business running while everyone else was enjoying their time away from work.
Without easy access to the rest of the team, good playbook/runbook documentation during this period is even more important than it is the rest of the year. If your typical on call process involves opening up a collaborative team Slack thread or Jira ticket and then letting various experts or senior SREs weigh in, you might face a rude awakening when your reliable experts are MIA during the holiday week.
Make sure your runbooks are updated and everyone has easy access to the DevOps systems they need before everyone leaves on vacation, because if on-callers are left to search half-empty wikis by themselves, the speed to resolution is going to be stressful for everyone - engineers, managers, and executives alike.
Additionally, make sure that primary and secondary on-callers know exactly who they are paired with for a particular shift. Make sure that both are committed to being fully available and sober during their shifts, so that in the absence of the whole team, they are secure in having at least one problem-solving partner.
While it may seem obvious to plan your travel around your on call shifts so that you aren’t in the air while you’re supposed to be available, these days, we sometimes have too much faith in our connectivity during travel. Airports and even some planes have wifi, many airplane seats have built-in electricity and jacks, what could possibly go wrong?
Remember, this week is one of the busiest travel weeks of the year in the US, so our already strained infrastructure will be at the edge of capacity in the best of circumstances. You won’t be very effective at troubleshooting an outage if you are standing in a 2-hour long airport security line or sitting in a traffic jam on the way up to the mountains. Add unpredictable winter weather to the mix, and we have on call disasters in the making. So, what should on-callers do?
First, you should plan wider time-frames for travel around your on call shifts to avoid accidentally being unavailable. You also shouldn’t count on access to wifi on your flights (only some planes are equipped, and airlines often shift which plane is flying a certain route based on weather and mechanical issues). Don’t expect access to electricity in the airport to charge your computer or phone, as the high number of travelers may easily keep the charging stations fully occupied, and make sure you have your computer and charger easily accessible, so that if you are forced to gate-check a bag on your crowded flight, you will be sure to keep these precious items on your person.
Traveling on-callers should also remember that time zones are a thing - your on call planning will need to be adjusted accordingly. If you are the secondary, your primary may be in a different time zone (or vice versa), and you should discuss your plans with them beforehand so that you are prepared for them to be asleep at different hours.
Finally, if you are traveling anywhere that has inclement weather or otherwise unreliable access to internet, you should have a back-up plan, such as tethering your computer to your phone’s data service to get around unreliable wifi. If you arrive at your destination and realize that you have bad data service, finicky wifi, or (gasp) the possibility of power outages, you should admit your defeat early and find an understanding colleague to take your shift, rather than hoping for the best and leaving any alerts to your back-ups.
We’ve already talked about the importance of secondaries as problem-solving partners, but what about the other, more social challenges of being on call during this time? Let’s say you’re sitting down to a nice family dinner, and, just like you were dreading, there goes your phone dinging with an urgent alert. How are you going to explain this situation to your relatives, who in many cases, don’t really understand what you do?
This is where a friendly ally can help serve as your secondary within your social situation. If you arm a sibling or supportive partner with talking points to explain why you have to get up from the festive table to crouch frantically over your laptop in the back room, you will be able to focus on troubleshooting without worrying about the familial fallout.
It should go without saying, and so often it does, that there is a reason the world functions so smoothly during this week, despite its unique circumstances. While a lot of the credit is due to excellent engineering and team planning throughout the rest of the year, there are always unanticipated incidents that simply can’t be avoided. That’s why on call shifts, DevOps, and SREs exist in the first place! And so, as we look back on the year, and the incidents resolved during this special last week of 2019, let’s remember to give a shout out to all the engineering heroes who stepped forward to make it happen. Cheers to you!