Introduction

New to setting up Transposit? Learn how to get started.

Each organization takes a different approach to setting up and configuring their incident resolution processes, while taking account of their unique needs, teams, and structure.

No two processes look exactly alike, but two styles are commonly used: ITIL, as taught through information technology infrastructure library certifications, and DevOps and site reliability engineering (SRE).

The ITIL incident management process

The IT approach to incident management uses a strong incident management plan, structured with defined steps that map to roles. The ITIL incident management process is one of the most widely adopted IT frameworks. It follows these steps:

  1. Identification and logging. Identify an incident through testing, user feedback, infrastructure monitoring, or another measure, and log the incident for future reference.

    To log an incident, record:

    • The exact or approximate date and time of the occurrence
    • A brief description of the incident that includes title and error code where applicable
    • The name of the person who logged the incident
    • Details of the person assigned to the incident for follow-up
    • The current status of the incident
    • Relevant attachments, including technical discussions, decisions, and approvals
  2. Classification and prioritization. Categorize the incident based on its type (i.e., software, hardware, or service request). Prioritize the incident based on its impact, severity, and level of risk so that data from tracked incidents can influence better business decisions and problem management.

  3. Investigation and analysis. Investigate the details of the incident to determine how to resolve it; gather information to prevent it from happening again. After determining the root cause, identify and test a hypothesis to come up with a diagnosis.

  4. Resolution and recovery. After diagnosing the incident and determining how to resolve it, implement the resolution, test it, and bring the system back to its previous working condition.

  5. Closure. Retest the solution. If everything is working as intended, and the user who reported the incident indicates the service is restored and marks it as resolved.

The DevOps and SRE incident management process

A newer, less structured, but equally effective approach to incident management stems from DevOps teams and SREs. It’s more of a culture than a framework, and several key elements define its character.

  • Preparedness. DevOps and SRE engineers value data and put metrics front and center within incident management. Through the continuous refinement of measures that monitor performance and identify issues, detection becomes proactive and forms a part of everyday operations. This approach prevents incidents from becoming serious by making sure they are detected early and met with a plan. Furthermore, with the right telemetry, predictive analysis can be used to foresee incidents and even prevent them outright. Each incident teaches the team how to better prepare for the next.

  • Collaboration. While the ITIL incident management framework maps to individual roles, teams take the spotlight with DevOps and SRE incident management. There’s never just one person responsible for resolution because pooling resources supports efficiency and valuable insights can be found across an organization. It’s about skills, not job title. Still, the people who built the system know it best and are certain to be involved in fixing it.

  • Continuous learning. Including engineering teams in the incident management process holds them accountable for their work and prevents conflict between departments. This ensures that solving problems takes priority. End-to-end involvement means with each incident, engineers learn from mistakes and improve multiple capabilities. The knowledge they gain helps them to prepare for and solve future incidents, as well as to minimize incident occurrences by adjusting the way they build.

These key elements demonstrate how comprehensive the DevOps and SRE incident management approach is. While there’s no standard set of steps, the process does follow some general stages:

  1. Detection. Through strategic monitoring of whole systems using continuously optimized tools, teams regularly expose vulnerabilities and detect incidents early as part of their regular work. Detection isn’t limited to a single role.

  2. Response and resolution. After an issue is detected, the incident commander takes charge of coordinating the response. This includes bringing the right people together to investigate the incident, gathering data to determine actions to take, and communicating with internal and external stakeholders.

  3. Analysis and preparation. Incidents serve as learning experiences and spark continuous improvement. As part of a post-incident review or retrospective, the data is analyzed — without blame — and applied to action plans, documentation, and runbooks for future incidents or to development work that could prevent them.

Best practices for incident management

The right steps and the right tools make the foundation of effective incident management, but results depend on taking the right approach. Four key practices optimize the incident management process:

  • Automate incident response. Automating typical actions, like creating a Jira ticket and Zoom meeting or running an AWS service check, takes the pressure off strained response teams.
  • Create consistent processes. Codifying free-form processes into runbooks improves consistency and speed, ensures new teammates can execute incident management processes, and enables more thorough and accurate retrospectives. Consistent processes also allow humans to be brought in strategically to use their judgement and insight when it’s most valuable.
  • Use a system of engagement. Consolidating all data and communication about an incident in one location—whether within Slack, Microsoft Teams, or another platform — facilitates effective collaboration and decision-making.
  • Centralize stakeholder updates. Centralizing updates ensures stakeholders can easily stay informed of an incident’s status through a single dashboard without interrupting others’ work. It also supports the collaboration of on-call engineers in a more insulated environment that promotes focus and actionability.

Next Steps