During an incident, every second matters. Enterprise downtime costs are measured in the thousands of dollars per minute. An organization’s ability to communicate effectively during an incident directly impacts its ability to recover and restore service. And with the sudden shift to remote work, communication has become a glaring problem. No longer able to gather in one meeting room, engineers are disjointedly collaborating through various tools — multiple chat channels, email, phone, ticketing systems — and updates to internal and external stakeholders are prolonged and difficult to orchestrate.
Given these challenges, how can organizations optimize their communications to accelerate service restoration? The three concepts below may help.
When incidents occur, there is pressure from both internal and external stakeholders to act and show results as quickly as possible. For a single action, the “fast” thing to do may be a manual operation by a single operator or a small group. On a longer timeline, such as one or multiple incidents, these individual steps tend to create entropy and may actually slow the recovery process.
To reliably increase the pace of a process, the key is to create a habit of consistency and precision of updates. For instance, SRE teams can create a protocol for how often to update their status page, depending on the incident severity. This is not unlike how modern software engineers develop and test features and applications. By publishing information “early and often,” teams can establish a regular cadence for updates and establish expectations for customers and stakeholders.
Keeping up a regular cadence can be challenging during the flurry of an incident, so automation is a key enabler of consistency and simplicity here. The more automated this process is, the more reliably it will flow, freeing up precious human resources to address other concerns.
The most detailed and contextual information will come from the people and systems closest to the incident. As much detail should be preserved in the first (likely technical) system of record and filtered downstream as it reaches communications channels with more specific purposes and audiences, like a status page update for customers or a Slack message to stakeholders.
It’s helpful to make liberal use of fields and tags or labels at this stage, like adding a label to a Jira ticket with the incident severity level. Categorized information can be easily passed to specific systems and reused without having to be manually copied or translated. For example, the title or summary of a Jira issue could make an appropriate status page update when separated from its more detailed commentary. Manually entering or copying data between systems in the middle of the process can slow progress and introduces opportunities for errors.
A data-rich pipeline allows people throughout the organization to easily access the piece(s) of information they need to perform their role. By feeding the pipeline with data from the technical system of record, the following stages of the process “inherit” their data from a single source. The users and systems downstream can then play to their strengths without having to maintain a complete picture of the incident — for example when marketing wants to send out Tweets about the incident or customer success is responding to tickets. This allows different teams to use tools they are already familiar with rather than struggling with an unfamiliar application in a stressful situation.
Once information has been gathered during an incident, the question of how to distribute it follows close behind. Data can either be pulled from systems within the organization (i.e. a ticketing system) or pushed to those who need it.
Pulls require regular polling for updates, with no guarantees that any new information is available. This creates wasted operations and taxes the system(s) being polled (such as Jira or a database) unnecessarily. If the polling is not fully automated, human time and effort is also consumed.
Pushes can be triggered on events, such as the creating or updating a ticket. This is generally more efficient than the “poll and pull” option. Developers and SREs can automate pushes from one system to another with familiar tools such as webhooks and APIs. This model has additional benefits when data is well categorized and labeled. Specific data can be pushed from one system to another based on metadata like field names, tags, or labels. An example of this process would be creating a status page update or tweet based on the title of a ticket. Reusing data in this way doesn’t require people to re enter or modify data manually.
At Transposit, our mission is to help organizations accelerate the speed to resolution. By automating manual tasks, bringing together your entire stack, and creating a single source of truth, orgs have the clarity, context, and process needed to execute incident response faster and with greater ease.
Communications during an incident should be an asset, not a barrier. Transposit helps organizations exchange information and automate communications across multiple channels with: