🌀What Happen at CrowdStrike? A Global Tech Outage!
This post aims to break down the events in a technical way that everyone can understand. A Programmer perspective.
Hi, I’m Wajid Khan. I am trying to explain computer stuff in a simple and engaging manner, so that even non-techies can easily understand, and delivered to your inbox biweekly. Join me on an under-the-hood tech journey.
An unprecedented IT failure occurred over the weekend when computers worldwide were rendered useless due to a software update gone wrong, marking it as quite possibly the biggest failure in the history of humanity.
The situation caused a halt in plane operations, an offline status for banks, and a complete stoppage of machines in hospitals.
🚨The goal of this post is to explain the events in a technical manner that can be understood by everyone.
To begin, let's direct our attention towards the actual events that took place. To begin, it would be best to start with the basics. Every single device in the world that runs software, like your phone, laptop, or even a TV at the airport, is built on an operating system.
The operating system is like the brain of a computer. It makes everything happen, like Word on your laptop or flight times on TV.
Microsoft Windows is the most popular operating system, no question. Guess what? Windows was running on more than 70% of desktops last month, which probably shocked the Mac fans (15%). Commercial applications, including factory robots, hospital devices, airplanes, and non-mobile or laptop computers, greatly favor the use of Windows.
If there is a malfunction with the operating system, such as Windows, your device will cease functioning entirely. If the operating system fails to function, the device will not be able to operate. The CrowdStrike outage occurred exactly as depicted. The software caused severe issues with the Windows operating system, rendering computers unable to start.
When it comes to outages, nothing is quite as unsettling as a non-responsive operating system that refuses to boot up.
🚨This post aims to offer a detailed account of the events from a technical standpoint, presented in a manner that is straightforward and accessible to all readers.
The only way to fix it is by manually rebooting the computer in a special way, which allows you to remove the offending code. Yet, the users of these devices are typically not tech-savvy individuals and likely have little to no knowledge about the concept of an operating system. The reason why it took a long time to resolve this outage was due in part to the need for significant assistance and guidance from CrowdStrike.
What led to this outage? Who exactly is CrowdStrike?
If you were raised in the 90's like me, it's likely that you had some form of Norton installed on your computer.
Norton is a brand known for its antivirus software. It resides on your computer and aims to safeguard it from viruses that hackers and malicious individuals try to spread online.
When it comes to antivirus software, there are a few generic methods it uses, including examining files that seem suspicious or unfamiliar. Norton has assembled a team of researchers dedicated to discovering the most recent vulnerabilities and malicious software deployed by hackers. They ensure that your computer is always up to date by adding new items to its search database, and if any of those items are detected, an alarm will sound. These kinds of updates are consistently published to the software, allowing your computer to stay informed about what it needs to search for. CrowdStrike essentially performs the same functions, just with improved marketing in 2024. Instead of targeting consumers like you and me, their primary target market is businesses when it comes to selling their software. A total of 300 companies from the Fortune 500 utilize CrowdStrike's services. The outage had a profound effect on numerous devices in various industries, showcasing its widespread impact. The owners of these devices, including renowned companies like INTEL and ERICSSON, were part of CrowdStrike's client base.
Virus protection and the OS.
Deep access to a computer's operating system is necessary for antivirus software to function properly.
Deep inside, It keeps an eye on all the little things an operating system does, so it needs to be able to see them and shut them down if it doesn't like what it sees this particular piece of software has access to the most sensitive and confidential information on any computer, more than any other software.
🚨Antivirus software controls everything your operating system does.
Antivirus software having special access to operating systems has always been controversial. Back in 2008, a researcher wrote a whitepaper about how antivirus software could be vulnerable itself, and how things could go really wrong because of that.
On Friday, CrowdStrike rolled out a software update that included a new Channel File, bringing added features to its users. It is worth remembering that antivirus companies allocate resources to teams dedicated to researching and addressing the most recent vulnerabilities. Once they find a new one, they update their software to start searching for it. That's what a CrowdStrike Channel File does - it detects a specific kind of malware. This one had a seemingly harmless name, such as:
C-00000291.sys
Unfortunately, this file encountered a significant issue when it tried to access non-existent data on Windows. Developers use the term "Null Pointer Exception" to describe a situation where a program tries to access nonexistent data, leading to unpredictable outcomes. Software engineers often overlook this bug type, considering it to have little impact.
As I have progressed in my coding journey, I have encountered a multitude of instances where Null Pointer Exceptions have occurred in my code.
However, in this particular instance, it was not a usual or ordinary situation. Since CrowdStrike's software is deeply integrated into the operating system, its malfunction resulted in a complete system failure. Your Windows machines' startup was affected by the faulty file sent during the software update, resulting in the inability to start up. As a result, the Blue Screen of Death (BSOD).
The dreaded blue screen in Windows can bring you to a full stop due to a serious system crash
Why Windows specifically?
Lastly, let's discuss why this issue specifically impacted Microsoft Windows. CrowdStrike offers products for MacOS and Linux, the other two major desktop operating systems.
🚨In brief, this particular channel file was related to a vulnerability in Microsoft Windows rather than in MacOS or Linux.
Windows is the dominant operating system in the enterprise and commercial world, leaving little room for MacOS or Linux on most devices.
However, there is more to the story. When it comes to building Windows, Microsoft takes a different approach than Apple, favoring a more open system with less tightly integrated software and hardware.
Apple limits the level of control that software like CrowdStrike has over the operating system.
What caused the delay in resolving this issue?
It is common for incidents like this to be resolved swiftly. However, even though CrowdStrike swiftly released a solution for the problematic file as soon as they found out about the issue, some systems remained offline for hours, and in some instances, even days.
Furthermore, this is another indication of the privileged relationship we discussed in relation to the operating system. Standard software has the ability to receive updates via the internet in order to resolve software bugs. Nonetheless, to rectify the problem and restore the functionality of the operating system on numerous devices running different iterations of CrowdStrike, it was necessary for users to undergo a manual restart in safe mode - a specific restart mode intended for addressing such issues. Following this, users had to manually locate and delete the C-00000291.sys file.
It is worth noting that most users of these devices do not possess advanced computer skills. This particular task demands substantial assistance in order to guide individuals through the steps involved, like using the Terminal.
Hi, I’m Wajid Khan. I am trying to explain computer stuff in a simple and engaging manner, so that even non-techies can easily understand, and delivered to your inbox biweekly.