Another Thursday, another day of sluggish performance on the platform. The operations team is starting to see a pattern in the data. They’ve looked over a repository of recurring low-level incidents, and DNS hang-ups have been showing up regularly.
A workaround is in progress, but applying a bandage won’t fix the underlying problem: the system is vulnerable to DNS issues because it relies on only one provider. To solve the problem and make the service more resilient, the team needs to make changes that implement load balancing across a second DNS provider.
They’ve looked at the trends to identify a systemic problem, gathered the information they need to address it, and made a plan to resolve it. The goal is to prevent future incidents that might have occurred without these efforts and improve service overall. All in a Thursday’s work for teams practicing problem management.
What is problem management?#
Problem management is a process for holistically evaluating systems, processes, and performance to find and implement better ways of working that prevent incidents or generally improve the service. It means looking deeply at patterns that point to contributing factors and finding solutions that last. The problem management framework helps drive continuous improvement, and it’s an essential part of IT service management.
What is the problem management process?#
The problem management process follows a defined structure:
- Anticipate: Keep tabs on systems and workflows and identify ways to optimize them. Act as a resiliency advocate, calling for improvements that anticipate and prevent potential problems.
- Detect: Seek out problems that need to be managed. Monitoring performance reliability data for trends, reviewing past incident retrospectives to identify patterns, and auditing team processes are ways to detect a problem.
- Evaluate: Categorize the problem and decide its priority. It might stay at this stage for a while if there’s more important work to do.
- Investigate: Look into the problem and identify its contributing factors. Create a record within the known error database.
- Plan: Plot out how to remediate the problem.
- Resolve: In the best-case scenario, the previous steps would bring clarity to the problem and produce a good plan for fixing it. The fix would eliminate the problem and prevent future incidents, bringing the problem to a close. However, sometimes the process doesn’t produce a full solution. In this case, it may be necessary to build a temporary workaround, then go through the steps again.
Proactivity is a key feature of effective IT problem management.
Managing problems proactively#
For effective problem management, teams work proactively to prevent incidents. While it’s impossible to see into the future, there’s still plenty teams can do to manage problems before they arise. Preventative action is essential.
For example, they could set up the server with the flexibility to handle fluctuating loads. Balancing the load across multiple containers could prevent a 503 server outage.
This approach could also include identifying clients affected by an issue before they notice and report it themselves. With something like a memory leak in a current build, leadership could decide to send an email to those affected, encouraging them to upgrade.
A workaround is a temporary solution. It acts as a bandage to keep incidents at bay while the team continues to investigate the target problem. Teams use workarounds when there’s no clear way forward, but it’s important not to leave them in place for too long. Problem management must resume.
Because they’re not designed to last, workarounds can break — and become problems themselves. So, problem management includes a sub-process for workaround management.
Different roles contribute to different parts of the process.
Who does problem management?#
Problem management relies on several key roles:
- DevOps: This role comes with a culture of continuous improvement and chases problems with gusto. In other words, DevOps is proactive. It's the lifeblood of problem management.
- Site reliability engineer (SRE): This role applies a developer skill set to automate and improve the resilience of the service. SREs identify and resolve problems within their wheelhouse.
- Customer support engineer: People who occupy this role are frontline workers that deal with client-reported issues (and deliver their fixes). They must respond reactively.
- Leadership: This role doesn’t make the fixes, but it approves them. Leaders oversee the process in a way that’s active and engaged.
Though these roles are dedicated to their areas of expertise, they need to work together, too. It helps when they’re empowered to stay connected through the tools they use.
Helpful problem management tools support tracking:
- Ticketing system: Teams track identified problems, and the progress they make on them, using tickets. Spotless records set up success.
- Logging system: This is the machine side of tracking. Teams use the data recorded in log files to identify problems.
Leveraging tools is one best practice to follow for this process — but not the only one.
What are the best practices for problem management?#
All the best practices encourage transparency and communication:
- Version control: Using a system of version control allows code to be continuously improved. It makes changes visible and trackable, encouraging accountability within the team.
- Retrospective: Including a review of the process as a formal step after resolution just makes sense. Problem management teams investigate problems deeply before acting on them. Looking back on their work lets them evaluate their solution, too. This sets them up to recognize patterns and take proactive, preventative action in the future.
Get to the root of the problem with Transposit#
Transposit ensures teams have accurate and comprehensive data from both humans and machines to thoroughly investigate incidents, execute root-cause analyses, and ultimately drive continuous improvement.
- Capture human and machine data: Automatic incident timelines record the full history of actions taken throughout the system, providing the audit trail teams need to analyze incidents and problems.
- No-stress post-incident reviews: Transposit’s customizable post-incident export ensures teams have the data from both humans and machines to thoroughly investigate incidents and execute root-cause analyses.
- Drive continuous improvement: Visualize and analyze your team's response to incidents and events with comprehensive analytics from MTTR to MTTA, and more.