Create Shared Context During Incidents: How to Automate Investigation in Slack
From fully automated to human-in-the-loop — here’s how to automate incident investigation in Slack.
Picture this: It’s 5 am. An alert signals a critical system malfunction. Mark, still groggy, is thrust into a whirlwind of system logs and metrics, seeking answers. He’s pulled in different directions — AWS insights here, application metrics in Datadog. Slack messages flood in with teammates’ observations and theories. This maze of fragmented information feels like piecing together a jigsaw puzzle in the dark. There must be a better way.
With incident investigation typically being the longest and most expensive part of an incident, how can teams efficiently bring together data and context for a unified, collaborative approach?
The Imperative of Context
As soon as an incident occurs, the clock is ticking. The goals are clear: gauge impact, assess severity, and chart the remediation course. However, the journey is hindered by:
Data Dispersion: Metrics spread across tools like AWS, CloudWatch, and AppDynamics mean manual, tedious searches.
Blurred Views: Varied data sources can give team members different perspectives on the incident.
Absent Records: The lack of a holistic decision-making trail complicates post-incident analysis.
How to Automate Incident Context in Slack
Transposit enables a human-in-the-loop approach to automation, meaning that you can pull the data you need when it’s most relevant. But many teams will find that they can go even further, fully automating some repetitive pieces of the investigative process.
Let’s break down how Transposit can help your team accelerate investigation while doing so collaboratively.
Seamless Slack Integration
For Slack organizations, Transposit transforms the platform into an incident command center, bringing:
- Unified Data: Minimized context-switching with consolidated data.
- Team Synergy: Real-time, cohesive insights ensure everyone’s on the same page.
- Transparency: Steps taken are visible to all, even to those joining in late.
- Documented Actions: Every measure taken gets recorded, offering a clear retrospective view.
Teams can take multiple approaches to automating investigation. Many teams start by creating scripts that can then be used by operators during the course of an incident — a human-in-the-loop approach. Team members can decide what data is necessary and when — easily running a script with a single click.
Fully automated workflows
Recognizing the patterns in your incident process helps you build a more robust automation strategy. Transposit automatically captures every action and conversation from your Slack channel, recording it all in a Timeline. By reviewing past incidents, it becomes clear what actions your team takes every time — the perfect fit for fully automated scenarios.
There are two ways you can fully automate investigative steps:
1. Trigger a script from incident creation
Many teams fully automate incident communication (automatically creating Slack channels and Zoom meetings). But why not add some initial investigation actions, as well? These may be actions like running an AWS service status check or pulling recent commits.
2. Trigger scripts when an incident has changed or a condition met
You may also choose to be more precise with your automation. You can trigger scripts when an incident state has changed, like the severity has been set to 0 or 1, or you updated the “impacted services” field.
Here are some ways we see customers automating investigation:
- Pull recent commits CircleCI, Jenkins, or Github to see if there were any recent deployments that failed.
- Check public status pages for AWS, GCP, Azure, Digital Ocean, CircleCI, Github, etc. for those 3rd party dependencies you have.
- Pull graphs from tools like CloudWatch, Azure Monitoring, or Google Cloud Monitoring to check CPU, memory, or disk usage.
Ready to automate incident investigation?
The traditional ways of incident management, characterized by manual tasks, fragmented views, and lack of transparency, don’t keep pace with the agility of DevOps teams. By using Transposit in Slack, teams are empowered to quickly gather data, share context, and maintain a clear audit trail. Ultimately, by adding automation to this difficult and often time-consuming part of incident management, you’ll be able to resolve incidents faster, with less stress.