How to address the complexity and stress of managing disjointed communications while the MTTR clock is ticking
It's 5am PST and the pager goes off… again. Neil, an SRE at an up-and-coming ad tech startup, sees the alert and rolls out of bed. The first thing he does is check the details that came with the alert…logins are failing. But before he has time to create an incident Slack channel, he gets an email from a customer support manager. Tickets are piling up as customers on the East Coast are waking up and noticing the failure. As he rushes to set up a Statuspage update, his boss and his boss's boss email, and Slack him—they need an update. Customers are impacted. Why isn't the Jira ticket updated? When can legal get an update? Contracted SLAs are now at risk….Any idea when it will be fixed???
The stress of this situation is only too common, and Neil hasn't even addressed the troubleshooting yet. What is holding Neil back isn't just the difficulty of triaging and collaborating with other engineers to fix a customer-facing issue, it's the complexity and stress of managing disjointed communications while the MTTR clock is ticking.
But the reality is that good communication is necessary during incidents. Stakeholders need to know what's going on, and on-call engineers need to be able to communicate without being distracted from their primary task—resolving the issue. Yet even teams with the privilege of having a communications lead role within an on-call unit struggle with the sheer number of communication platforms involved in a standard incident.
As Neil's example demonstrates, there is no "incident response" without a human response.
But when tools are siloed and stakeholders are spread across different communications platforms, engineers are wasting precious mental energy in a disjointed response process. Within moments after an alert, on-call engineers can be ushered through more than a handful of different tools to communicate with teammates, stakeholders, and customers. It's a chaotic, unsustainable situation that leaves on-call engineers fatigued and services vulnerable longer.
Communication is a special gift that makes humans unique, but an overload of communication can be just as bad as none at all. The increase of SaaS tools in our arsenal makes clear, consistent communication harder than ever when it's needed most. It takes a mental toll on the incident commander and their ability to masterfully eliminate the issue. That 2 am page turned to 3 am and then 4 am, and not only was the service down longer but engineers are spent.
So how do humans solve this issue of communication overload? Being proactive is a good place to start. Incident management is reactive by nature but maintaining this status quo around incident communications can make many SREs and on-call engineers feel like they've been thrown to the wolves.
Use these questions as a guide to start a conversation about how to make incident response communication better:
Be clear about what communication tools are essential during incidents, and look for ways to centralize them. For instance, can your team and stakeholders agree to eliminating email entirely? If you're using multiple tools for the same purpose (like multiple chat tools), can you cut them down to one? Maybe even more importantly, can you bring all your services into one centralized tool that integrates with them all? Each tool in your arsenal holds a purpose, so it's often less about eliminating and more about bringing them all together.
It's important to create decisive guidelines about how and when to use your communication tools. Can you create set processes that indicate after how much chatting teams should jump onto a video call? Can you choose a specific Slack channel for the incident? What about bringing operations into Slack where you're already working, piping in important data, so engineers aren't playing hopscotch all over their stack?
On-call engineers in the midst of an incident need to spend as much time focusing on remediation as possible. Are there rules you can implement that help protect them from unnecessary chatter? For instance, should stakeholders be reaching out for answers or are there other places they can go to find what they're looking for? Can customer support update customers without pinging the engineer? And where does your incident commander or communications lead play into the process—can they be a protector between on-call engineers and the chatter?
At Transposit, we believe communication during incidents should propel your team to speedier resolution, not hinder it. Connecting data and humans in one singular process reduces the noise and amplifies collaboration. Transposit brings siloed communications together where you're already working, like Slack, and also provides a full view of the incident timeline (with all the human interactions) in the incident command center. Stakeholders can easily access the information they desire without being an interruption, and on-call engineers can collaborate in a more insulated environment that promotes focus and actionability.
We'd love to hear how your organization handles communication overload during incidents. We'll be at Chaos Conf and DevOps Enterprise Summit — message us on the conference Slack to strike up a conversation!