Do you remember the global fallout of the Y2K computing disaster of the year 2000? Neither do we. Here's why.
The countdown hit midnight on January 1st, 2000, and... things were pretty uneventful. At least this was the reality for many working in IT at the time of Y2K, also known as the Year 2000 Problem or the Millennium bug. Often now a punchline of a joke, the lead up to Y2K was months of collective IT operations work across every industry that was running any computer systems at the time. For many, this work began two years earlier and spanned across whole companies. It was anything but a joke for many.
For me, Y2K was a time for school photos and celebrations. I didn't fully comprehend what was going on within IT at the time. My Microsoft Access 97 databases that I was running on my home PC for fun weren't exactly mission-critical.
It's been 20 years since Y2K, and during January, I will take a look at where we've come since 1999 and if the rise of new software patterns and trends in the industry have helped or hurt how we operate within our systems.
Today's computer systems now dwarf what we had in 1999. We've introduced more complexity in all sorts of ways: microservices, cloud infrastructure, integrations, mobile applications, IoT, and more. As Colin Horgan said in their piece, We're Finally Learning the Lesson of Y2K — and It's Too Late, "in the last two decades, we have continued unabated in building ever more complex computer systems." In 1999, we were not as reliant on computers for every facet of our lives as we are today. "Y2K should have made us question our faith in the machines," said Colin. Instead, we went full steam ahead.
Meanwhile, many industries have shifted away from the traditional mainframe to more modern systems, but these systems are actually more complex to operate and require specific skills and knowledge. For example, have you ever looked at the Amazon Web Services product page? Whole certifications now exist only to understand one cloud infrastructure provider, and this is not isolated to just AWS, but also Google Cloud and Microsoft Azure.
To make things even more complex on top of new infrastructure, we've added new software architecture patterns, like microservices, and experienced a massive growth in APIs. It's nothing for some large companies now to be running hundreds of different microservices and integrating with SaaS-based services through APIs up and down the stack. Talk about the potential for cascading failures, right?
Lastly, mobile, and IoT devices increasing have added another complexity. In 1999, personal computing had already gained steam. The software was no longer only in the basement or data center for you to update. But as the popularity of smartphones and IoT devices has grown over the last 20 years, the number of devices out there has skyrocketed. Certain things get more complicated with more devices and less control over these devices, like downloading and installing important patches to ensure the devices are secure.
In 1999, it took thousands of hours of spreadsheet work to keep track of the status of different machines in a fleet. What updates had been installed? Was it waiting on specific patches? If Y2K happened today with the increased complexity of our systems, "from an operation standpoint... it definitely would be a lot easier today to just understand what's going on with your fleet," says Matt Stratton, DevOps Advocate at PagerDuty who was working as a sysadmin at Heller Financial during Y2K. There was no such thing as a patch management tool, and automation was still mostly a thing of the future.
Although, one of Matt's favorite automation stories is from Y2K. At the last minute, he got the idea to write a script that would automate the shut down process for the whole organization. He saved it onto a floppy disk and removed the script from his computer. He remembers "being very paranoid about that disk because [he] was like, this is like the big red button of this company." 8 pm rolled around on New Year's Eve, and it was time for his little script to shine. He ran it, and at 8:15 pm, everything was shut down, and they could go home.
This was not what they thought it would be like though. Most of his coworkers, including himself, thought they would be busy all night. Thanks to his script, they weren't. Looking back on it, Matt thinks about how "it sounds really funny 20 years later to be like, duh, why would you not do that? But it was revolutionary for the team to consider… we could write a script to do that." At the time, automation just wasn't a part of how IT thought about operations tooling.
Andy Piper, who was a developer on the then-middleware infrastructure team at the UK Post Office, recalls that there were "fewer moving parts, and they were under better control / commercial contracts." Despite having fewer tools at teams' disposal, the problems they were trying to solve didn't include the idea of microservices, mobile applications, and IoT devices.
Also, Matt McLarty, Global Leader of API Strategy at Mulesoft, who was working as a developer at a bank in Toronto, Canada during Y2K, isn't sure how much tooling would help runtime failures and that "fundamentally it's an old architectural problem." "The big change would be just how you would do the resilience engineering" and how big cascading failures are handled by everyone from on-call teams to system architects.
Going into 2020 we have a whole suite of tooling at our disposal to practice DevOps and operate our systems around monitoring and observability, but "the problem about modern observability practices [is that] they're still all based around when a thing happens that we didn't know to plan for, which is great, but how do you plan for [it]," asks Matt Straton. The next Y2K-like bug will likely not be related to time dates because we are more aware of this after Y2K. The next Y2K-like bug will be something the industry isn't all thinking about. It will be an unknown unknown. It will be something "we couldn't predict based on an assumption," says Matt Straton.
So, while we've come far from the IT operations tooling of 1999, we still are in the same situation, even with significant changes due to the cloud, DevOps, and modern agile practices.
Throughout January, I'll continue to look at how far we've come since Y2K and how things are the same as they were before. In the second post, we'll explore how we've repeated Y2K history in the last 20 years and how DevOps can help us improve, and in the third post, we'll look at how ideas like chaos engineering and resilience could be used to manage change in our complex systems to avoid a future Y2K.