What Happens When DevOps Cat and Chaos Collide?

Do DevOps Cat and chaos mix?

Laurel Frazier · Oct 7, 2020

We’re learning so much about how to embrace chaos at Chaos Conf this week. But what about DevOps Cat? Last we saw him, he was preparing to respond to a 3 AM alert that web service is down by running his human’s runbook.

Comic: This happened at 3AM. Alert: Web service is down. Image: Cat woken up on bed, then on computer. Threshold crossed 10 of 10 data points. It's an all paws on deck type of situation. Luckily my human already had her runbook set up -- so I can just run it?

Runbooks are sets of troubleshooting steps and tips that are valuable when responding to an incident or other operational tasks. Let’s see what DevOps Cat’s runbook says...

Comic: Runbook says I should run this script by copying this command :( on the command line... Again, the cat in the story saves the day Comic: ERROR: Cannot scale web service. Service name not found

Oh no! As we can see, the action that the runbook suggested — copying a specific command — wasn’t as helpful as DevOps Cat expected it to be. Instead, he is paw-deep in an error message. Can chaos be the answer to a speedier response?

Comic: When was the last time the runbook was tested? (No one knows) Well, Gremlin + Transposit can help. If you run more incident management fire drills with Gremlin using your Transposit runbooks, and automate the relevant pieces, your runbooks would be up-to-date when you need them! Comic: Next time when an alert goes off, it wouldn't be so bad. Image: Runbook with a validated with Gremlin date

Of course! A runbook is only as useful as its content. Remember, runbooks should always be actionable, accessible, accurate, authoritative, and adaptable. In this case, the runbook was not accurate. By validating and maintaining runbooks consistently through chaos engineering's rapid feedback loops, the improvements will be there for the next time you face an incident.

Comic: So when an alert goes off at 3AM again... Image: Cat following validated runbook to scale service Comic: It's hard being oncall for my human -- be kind to those who need to be oncall! Image: Cat and Gremlin saved the day, again!

You know what they say — a fire drill a day keeps the alerts at bay! So — be kind to yourself and your on-call team by staying prepared.

You can download the full comic here.

What would you like to see DevOps Cat tackle next? Let us know @transposit and stay tuned for DevOps Cat’s next adventure, brought to you by Yoko Li.

Try intelligent runbooks and simplified incident resolution