Each organization takes a different approach to setting up and configuring their incident resolution processes, while taking account of their unique needs, teams, and structure.
No two processes look exactly alike, but two styles are commonly used: ITIL, as taught through information technology infrastructure library certifications, and DevOps and site reliability engineering (SRE).
The IT approach to incident management uses a strong incident management plan, structured with defined steps that map to roles. The ITIL incident management process is one of the most widely adopted IT frameworks. It follows these steps:
Identification and logging. Identify an incident through testing, user feedback, infrastructure monitoring, or another measure, and log the incident for future reference.
To log an incident, record:
Classification and prioritization. Categorize the incident based on its type (i.e., software, hardware, or service request). Prioritize the incident based on its impact, severity, and level of risk so that data from tracked incidents can influence better business decisions and problem management.
Investigation and analysis. Investigate the details of the incident to determine how to resolve it; gather information to prevent it from happening again. After determining the root cause, identify and test a hypothesis to come up with a diagnosis.
Resolution and recovery. After diagnosing the incident and determining how to resolve it, implement the resolution, test it, and bring the system back to its previous working condition.
Closure. Retest the solution. If everything is working as intended, and the user who reported the incident indicates the service is restored and marks it as resolved.
A newer, less structured, but equally effective approach to incident management stems from DevOps teams and SREs. It’s more of a culture than a framework, and several key elements define its character.
Preparedness. DevOps and SRE engineers value data and put metrics front and center within incident management. Through the continuous refinement of measures that monitor performance and identify issues, detection becomes proactive and forms a part of everyday operations. This approach prevents incidents from becoming serious by making sure they are detected early and met with a plan. Furthermore, with the right telemetry, predictive analysis can be used to foresee incidents and even prevent them outright. Each incident teaches the team how to better prepare for the next.
Collaboration. While the ITIL incident management framework maps to individual roles, teams take the spotlight with DevOps and SRE incident management. There’s never just one person responsible for resolution because pooling resources supports efficiency and valuable insights can be found across an organization. It’s about skills, not job title. Still, the people who built the system know it best and are certain to be involved in fixing it.
Continuous learning. Including engineering teams in the incident management process holds them accountable for their work and prevents conflict between departments. This ensures that solving problems takes priority. End-to-end involvement means with each incident, engineers learn from mistakes and improve multiple capabilities. The knowledge they gain helps them to prepare for and solve future incidents, as well as to minimize incident occurrences by adjusting the way they build.
These key elements demonstrate how comprehensive the DevOps and SRE incident management approach is. While there’s no standard set of steps, the process does follow some general stages:
Detection. Through strategic monitoring of whole systems using continuously optimized tools, teams regularly expose vulnerabilities and detect incidents early as part of their regular work. Detection isn’t limited to a single role.
Response and resolution. After an issue is detected, the incident commander takes charge of coordinating the response. This includes bringing the right people together to investigate the incident, gathering data to determine actions to take, and communicating with internal and external stakeholders.
Analysis and preparation. Incidents serve as learning experiences and spark continuous improvement. As part of a post-incident review or retrospective, the data is analyzed — without blame — and applied to action plans, documentation, and runbooks for future incidents or to development work that could prevent them.
The right steps and the right tools make the foundation of effective incident management, but results depend on taking the right approach. Four key practices optimize the incident management process: