Chaos Conf Recap: Failure Is Inevitable—Create a Culture of Resilience

Where there is chaos, there can be calm, if only we have the right culture, tools, and processes

Jessica Abelson · Oct 9, 2020

We've had an awesome few days at Gremlin's Chaos Conf and were proud to serve as the Diversity and Inclusion sponsor. As the largest chaos engineering conference, we heard from the top dogs (and cats) in the industry, and our very own DevOps Cat was there to add some commentary on these su-purr-b talks.

A central theme throughout the conference was that failure is inevitable—incidents are simply a fact of life. This is true especially in current times as the pandemic has put more stress on our systems through a sheer increase in volume, pushing them in new ways, both technically and on the human side. Instead of running from failure, organizations should normalize it and build a culture of resilience to the unexpected while using tools and processes that put humans and machines in a symbiotic relationship.

Break Things to Learn Things

Kolton Andrus, CEO and Founder of Gremlin, started things off with a bang, sharing first-hand accounts of how Gremlin has helped teams be resilient in the face of failure. DevOps Cat knows all too well that failure is a part of life, so he gave a little splash to prove his point.

Comic says: Someone has to break the system first so the team can learn how to recover it. Image: Cat tipping over coffee with another looking on.

Luckily for engineers, Gremlin can be that "someone" to break things, instead of creating the mess themselves.

The Human Controller and Why Communication Matters

As good as our machines are, they always need a human controller. And humans need effective communication to collaborate under stressful situations. Adrian Cockcroft, VP of Cloud Architecture Strategy at AWS, has seen how confusion is caused because humans are seeing different information at different times while using outdated runbooks. Confused human controllers disagree about what to do, if anything.

In his years in the field, his perception of failure has changed. People used to say, "You can only be as strong as your weakest link," but believes this should be changed to, "But the last strand that breaks is not the cause of failure!" Incidents start long before an alert and even before the deploy that set the incident in motion. It goes back to human and machine processes that left the system vulnerable to begin with. And when incidents do inevitably hit, he said, alerts need to be reduced to actionable insights.

Jim Severino, Security TPM at Atlassian, gave us a great view into how a large organization with thousands of engineers around the globe runs incident management. He knows that incidents are inevitable, but it's how you choose to respond that makes all the difference. DevOps Cat agreed, 100% uptime is purr fantasy.

Comic: Incidents are inevitable. Image: Cat walked over wet painting.

And when incidents do inevitably happen, he said, the first and most critical step is to broadcast the incident. Using automation to communicate internally and externally has a huge return on investment by reducing confusion at the time of the incident and more quickly getting things back to green.

Adopting Continuous Learning and Eliminating Institutional Knowledge

One of the main facets of chaos engineering is to learn, but too often those learnings become institutional knowledge, which is held by only a few and lost when they leave the org.

Liz Fong-Jones, Principal Developer Advocate for SRE & Observability at Honeycomb, emphasized the need to "hypothesize, test, and learn" through Chaos Engineering. DevOps Cat was smitten with this idea, knowing the great pawsibilities that come when observability, chaos engineering, and smart runbooks work together to create a culture of resilience.

Comic: Hypothesize, test, and learn. Image: Observability with bees (Honeycomb) x Chaos Engineering with Gremlin x Smart Runbooks with Iggy (Transposit)

But while learning is critical, Tyler Wells, Sr. Director of Engineering - SRE/Platforms at Twilio, warned of the dangers of "tribal knowledge": "If we have to look at our very senior member or someone who is the architect of the piece to answer a number of questions about the system, that tells me that we have a lot of tribal knowledge. That's something we want to eliminate."

The solution? Doug Campbell, Sr. SRE at Grubhub, explained how they've implemented processes to "evangelize and educate." For their org, documenting use-cases of chaos experiments in internal technical documentation means everyone is on the same page, without having to engage an SRE team.

In Conclusion

Many more talks hit home what we may hate to hear but need to hear: failure is inevitable. As with anything in life, facing the facts will get us closer to our end goals: stronger services, happier customers, and resilient cultures.

What's our main takeaway from Chaos Conf? Where there is chaos, there can be calm, if only we have the right culture, tools, and processes.

Come Talk With Us at DevOps Enterprise Summit

We'll be at DevOps Enterprise Summit October 13-15. Come visit our booth to learn more about how chaos engineering and incident management automation work together to make stronger products.

Try intelligent runbooks and simplified incident resolution