Preparing for Your Super Bowl: A Guide for DevOps Teams to Get Ahead of Failure

Develop a game plan that enables operational maturity during the moments that matter

Laurel Frazier
Feb 10th, 2022
Share

9.64 million viewers tuned into Superbowl LV on February 7, 2021. This past summer, US viewers streamed “a record 5.5 billion minutes of events across social media and online platforms such as NBCOlympics.com, the NBC Sports app, and the streaming service Peacock”, making the 2020(1) Tokyo Games the most-streamed Olympics ever.

When large-scale sporting events like these occur, streaming platforms, broadcast networks, and social media light up with activity. Just like the athletes preparing for competition, Operations teams must also prepare for a surge in activity that will span hours on a Sunday to weeks as the Olympic competition unfolds. While these key sporting moments may not impact your organization, every industry experiences its own inflection points throughout the year. Many of these moments are predictable, expected, and happen year after year.

I spoke with Steve Stevens, Transposit’s Head of Customer Success & Support, on how he tackled these moments previously as a former Hulu resilience engineering advocate & problem manager, and throughout his 15+ years within IT Operations at Verizon.

Steve shared that early on, he discovered the importance of asking the right questions to jumpstart reflection and identify the true impact of any possible incidents. Instead of asking “what are your concerns with your service for XYZ event?” which would often garner replies like “no concerns here,” he began to ask “if this failed, what impact would it have on our customers or our brand reputation? If we didn’t do this change, would this impact our customers or our business?” These questions allowed for a more holistic assessment of the potential risks or failures and drove more meaningful conversations between development and operations teams.

Steve’s biggest piece of advice:

Failure must be seen as both a possibility and an opportunity.

If we embrace failure as one of our potential outcomes, we understand the impact that failure has on our business operations and our customer experience, and we can take deliberate steps to enhance service reliability and resiliency.

Together, we compiled some tips for TechOps teams on how to prepare for any big day(s) ahead.

Exercise your operational muscle

Use historical outages and failures as ways to perform wargame exercises.

  • Wargaming must be based on reasonable cases, so leverage historical outages and failures while still changing some of the scenarios, to make the best use of this.
  • Wargaming should be approached in a similar manner to fire, earthquake, or other drills we’ve done within the workplace or during school. These exercises should take place in a safe space that allows for failure so that we can learn and share best practices without judgment. Then, when an actual incident occurs, people will be more comfortable and confident to take necessary and deliberate action without panic, because they are prepared.

Use wargaming and chaos engineering exercises as a checklist to ensure you are setting yourself up for success:

  • Do you have the right tools to detect failure, take corrective action, or engage teams quickly to mitigate failures?
  • How do your tools provide solutions to known problems or vulnerabilities?
  • Do you have alerts in place to proactively alert for failures?
  • Are you monitoring for the right conditions at the right time?
  • Are you embracing human-in-the-loop automation?
  • Do you have the skills necessary to resolve failure? Keep in mind that external vendors, upstream dependencies, etc. should be taken into account.

Review game tape: Take inventory of your processes and current conditions

Review your processes to ensure they are mature and effective, and assess the status quo. Here are some questions to get you started:

  • Is your documentation (i.e. runbooks) up-to-date?
  • Have you validated internal service owner contact information? What about external vendor contact data?
  • Have you reviewed your major incident processes, and are they still relevant?
  • Were there lessons from previous incidents to drive process improvements, and have they been implemented?
  • What are potential areas of vulnerability you face today?

Additionally, you should consider implementing change pauses to avoid new possible points of failure close to the main event.

Know your team and rally behind your shared goals

  • Cultivate a collaborative and fail-safe culture. Be honest in sharing knowledge, best practices, but also concerns of potential vulnerabilities — understand where failures might be anticipated without surprise.
  • Take retrospectives seriously: Ask meaningful questions on recovery and how to identify incidents closer to the point of failure. Ensure that feedback during retrospectives turns into meaningful action rather than empty promises.
  • Keep the customer’s needs and experiences at the forefront, and use these to develop contingency plans.
  • Don’t freak out! Trust in your plans, processes, tools, people, and preparation.

Your winning strategy

Every Super Bowl champion or Olympian will tell you that talent and access to top-of-the-line equipment & training facilities will only take an athlete so far. Success at the highest level requires the right mindset, alongside an incredible amount of preparation. All-star TechOps teams must approach their operations with the same attitude — putting together the right mix of tools, processes, people, and perspectives to ensure they are prepared for the moments that matter.

To learn more about how Transposit’s connected workflow can deliver the visibility, context, and actionability your team needs to hone its own winning strategy, connect with us.

Share