Incident Management

Incident management is any process used in IT operations or DevOps for logging, recording, and resolving events that hinder business performance to restore service as quickly as possible.

Iggy
Apr 5th, 2022

In this post you will learn:

What is incident management?
Why is incident management important?
What does the process look like?
Incident management for DevOps and SRE
How incident management gets done
Best practices for incident management

Disruptive incidents can arise in any organization; therefore, incident management is imperative to combat outages, secure services, and ensure reliability. IT incidents can range from minor events that require nothing more than a review to major service interruptions that cause loss of revenue or reputational damage. The work of managing them, which is often urgent and complex, puts strain on IT teams. This makes incident management a critical success factor for any organization.

What is incident management?

Incident management is any process used in IT operations or DevOps for logging, recording, and resolving events that hinder business performance to restore service as quickly as possible. For example, network latency issues, container failures, unresponsive DNS servers, and outages caused by unoptimized database queries all count as incidents.

Distinct from processes for resolving bugs, defects, or problems that surface during testing, incident management applies to issues that arise when a product is live. Its core purpose is to resolve incidents quickly and efficiently.

However, the review process that follows an incident helps to identify causes and generates learnings that can mitigate future incidents. This step shares themes with problem management, which focuses on streamlining operations to address problems at their root.

Why is incident management important?

An incident management process enables organizations to confront issues immediately and mitigate negative consequences, which can be significant.

Revenue and customer satisfaction

The loss of customers and of business revenue can directly result from an incident and the response that follows. Poorly managed incidents keep organizations from delivering the level of service that customers expect. Customers may experience obstacles to their own productivity and bottom line or other frustrations that affect their happiness—and their loyalty.

Compliance

Global cybersecurity regulations mandate that organizations use incident management to protect sensitive data. Failure to establish a formal process or to prevent a breach could incur financial penalties and cause reputational damage.

Stress

Incidents can occur 24/7. This means many professionals in this field work on-call, often in a state of urgency, and burnout can manifest easily. An effective process can harmonize monitoring systems to minimize alerts, so on-call staff won’t be notified unnecessarily, and manage other pain points to lower stress. An organization’s approach to incident management factors into its success with hiring and retaining staff in these critical roles.

What does the process look like?

Each organization takes a different approach to incident management that accounts for their unique needs, teams, and structure. No two processes look exactly alike, but two styles are commonly used: ITIL, as taught through information technology infrastructure library certifications, and DevOps and site reliability engineering (SRE).

The ITIL incident management process

The IT approach to incident management uses a strong incident management plan, structured with defined steps that map to roles. The ITIL incident management process is one of the most widely adopted IT frameworks. It follows these steps:

1. Identification and logging

Identify an incident through testing, user feedback, infrastructure monitoring, or another measure, and log the incident for future reference.

To log an incident, record:

The exact or approximate date and time of the occurrence
A brief description of the incident that includes title and error code where applicable
The name of the person who logged the incident
Details of the person assigned to the incident for follow-up
The current status of the incident
Relevant attachments, including technical discussions, decisions, and approvals

2. Classification and prioritization

Categorize the incident based on its type (i.e., software, hardware, or service request). Prioritize the incident based on its impact, severity, and level of risk so that data from tracked incidents can influence better business decisions and problem management.

3. Investigation and analysis

Investigate the details of the incident to determine how to resolve it; gather information to prevent it from happening again. After determining the root cause, identify and test a hypothesis to come up with a diagnosis.

4. Resolution and recovery

After diagnosing the incident and determining how to resolve it, implement the resolution, test it, and bring the system back to its previous working condition.

5. Closure

Retest the solution. If everything is working as intended, and the user who reported the incident indicates the service is restored and marks it as resolved.

The DevOps and SRE incident management process

A newer, less structured, but equally effective approach to incident management stems from DevOps teams and SREs. It’s more of a culture than a framework, and several key elements define its character.

Preparedness

DevOps and SRE engineers value data and put metrics front and center within incident management. Through the continuous refinement of measures that monitor performance and identify issues, detection becomes proactive and forms a part of everyday operations. This approach prevents incidents from becoming serious by making sure they are detected early and met with a plan. Furthermore, with the right telemetry, predictive analysis can be used to foresee incidents and even prevent them outright. Each incident teaches the team how to better prepare for the next.

Collaboration

While the ITIL incident management framework maps to individual roles, teams take the spotlight with DevOps and SRE incident management. There’s never just one person responsible for resolution because pooling resources supports efficiency and valuable insights can be found across an organization. It’s about skills, not job title. Still, the people who built the system know it best and are certain to be involved in fixing it.

Continuous learning

Including engineering teams in the incident management process holds them accountable for their work and prevents conflict between departments. This ensures that solving problems takes priority. End-to-end involvement means with each incident, engineers learn from mistakes and improve multiple capabilities. The knowledge they gain helps them to prepare for and solve future incidents, as well as to minimize incident occurrences by adjusting the way they build.

These key elements demonstrate how comprehensive the DevOps and SRE incident management approach is. While there’s no standard set of steps, the process does follow some general stages:

1. Detection

Through strategic monitoring of whole systems using continuously optimized tools, teams regularly expose vulnerabilities and detect incidents early as part of their regular work. Detection isn’t limited to a single role.

2. Response and resolution

After an issue is detected, the incident commander takes charge of coordinating the response. This includes bringing the right people together to investigate the incident, gathering data to determine actions to take, and communicating with internal and external stakeholders.

3. Analysis and preparation

Incidents serve as learning experiences and spark continuous improvement. As part of a post-incident review or retrospective, the data is analyzed — without blame — and applied to action plans, documentation, and runbooks for future incidents or to development work that could prevent them.

How incident management gets done

People drive effective incident management, and assigning roles to different contributors helps manage the process. These are three common roles in incident management:

Commanders lead coordination and execution while making sure the right people are involved, in order to remove roadblocks. They’re In charge of updating key internal stakeholders and an external-facing status page.
Investigators support the commander in running investigations, pulling and analyzing logs, reviewing metrics, and determining a course of action for mitigating each issue.
External communicators ensure updates from the commander reach the right external stakeholders, such as customers and partners.

Effective incident management involves storing, filtering, and managing data in a centralized way. This allows teams to address problems systematically, instead of on an ad-hoc or reactive basis, giving them more oversight and improving their ability to stop problems early.

Using the right processes and tools promotes clear communication, both to stakeholders and among collaborators, and ensures lessons learned from incidents can be applied in the future:

Monitoring and analytics systems provide a continuous, holistic view of infrastructure health and supply data to support detection.
Service desks can make reporting incidents easy for users.
Alerting functions quickly notify the right people when an incident is detected.
Incident trackers and dashboards consolidate information about an incident and convey its status in real-time.
Documentation tools store relevant analyses, insights, processes, and plans for reference.
Instant messaging and virtual meeting services keep teams and stakeholders connected and facilitate collaboration.

Best practices for incident management

The right steps and the right tools make the foundation of effective incident management, but results depend on taking the right approach. Four key practices optimize the incident management process:

Automate incident response: Automating typical actions, like creating a Jira ticket and Zoom meeting or running an AWS service check, takes the pressure off strained response teams.
Create consistent processes: Codifying free-form processes into runbooks improves consistency and speed, ensures new teammates can execute incident management processes, and enables more thorough and accurate retrospectives. Consistent processes also allow humans to be brought in strategically to use their judgement and insight when it’s most valuable.
Use a system of engagement: Consolidating all data and communication about an incident in one location—whether within Slack, Microsoft Teams, or another platform — facilitates effective collaboration and decision-making.
Centralize stakeholder updates: Centralizing updates ensures stakeholders can easily stay informed of an incident’s status through a single dashboard without interrupting others’ work. It also supports the collaboration of on-call engineers in a more insulated environment that promotes focus and actionability.

Modernize incident management with Transposit

Transposit’s fully integrated, data-driven approach brings people, process, and APIs together, helping teams accelerate incident response, reduce mean time to resolution (MTTR), and meet service level objectives (SLOs).

Automate the “every time there’s an incident” tasks: Automate repetitive actions like creating a Jira ticket, PagerDuty incident, Slack channel, and Zoom meeting.
Classify incidents swiftly, with data at your fingertips: Pull the graphs, logs, and data your teams need to determine customer impact and severity, collaboratively through chat or the Transposit app.
Dispatch, inform and notify — with a single click: Dispatch on-call teams through PagerDuty, inform stakeholders like customer success and executives in email or chat, and notify customers in Statuspage — all through a single chain of actions in a runbook.
Mitigate customer impact faster: Take action across your stack to mitigate incidents, from rolling back a recent Octopus release to scaling an ECS instance. Even auto-remediate based on CPU, ram, or any other signal.
Turn failure into opportunity: Continuously learn and improve with automatic timelines and incident reports that provide the holistic picture teams need to prevent future incidents and ensure reliability. Learn more about modernizing incident management with Transposit.

Incident Management