A guide (for non-technical folks) to the CrowdStrike outage

Aug 6

What happened?

On July 19, 2024, 8.5 million devices were affected by a massive IT outage caused by a software update from cybersecurity company CrowdStrike. Planes were delayed, hospitals couldn’t carry out critical procedures, and emergency services were suspended, among other foundational societal services.

But wait, wasn’t it a Microsoft outage?

While some have described this as a Microsoft outage, mostly because of the blue screen of death and it because it only affected Microsoft Windows machines, it was a content update from CrowdStrike that caused the issue. However, it may lead Microsoft to make fundamental changes to what level of access software can have and when to prevent the issue from happening again.

How did this happen?

This incident is more complex than it may seem (or be talked about).

Most software applications operate in user mode, which is a more restrictive mode in the operating system that only gives applications access to specific areas of the computer’s memory and instructions. This is really beneficial because it limits what the application can do, or what damage it can cause.

Some applications (much fewer) also write applications in kernel mode. Kernel mode gives access to low-level system information unavailable in user mode. It also gives direct access to the hardware (like graphics cards, audio, Bluetooth, etc). Unfortunately, this access is a tradeoff; while more access is better for specific applications, it also poses a risk. When applications in user mode crash, they crash the application. When applications in kernel mode crash, they crash the entire system - total reboot. Best case scenario, you may still be able to boot up the operating system…worst case, you get stuck in whats called a boot loop and have to take manual, in-person action to fix it.

Endpoint security software operates in both user mode and kernel mode. Ideally, as much of the application as possible is delivered in user mode, and only mission-critical capabilities operate in kernel mode for safety reasons. Endpoint security tools use kernel mode because of the access it gives to low-level system information. All kernel drivers that operate on Windows machines go through some amount of testing and certification, endpoint security software more than others.

Critically, endpoint security vendors also do this to boot before the operating system. There’s a time prior to operating system (AKA Windows, macOS, etc) boot where kernel drivers and other activities can boot first. This is to ensure that threat actors cannot boot before the security software and disable it. However, it also means that when an issue occurs, it puts the system in a boot loop, pretty much ensuring you can’t access the operating system without taking in-person, manual action.

What is a (simplified) explanation of what technically happened?

A little background: CrowdStrike releases configuration files that are treated as inputs into the kernel driver. The configuration files are the way the application knows and is able to act on what potential attacker activity looks like. Configuration files are different than kernel drivers and don’t go through the same level of testing by Microsoft as kernel drivers do.

Keep in mind, this is an oversimplification of very complex software, but to oversimplify, this is the gist. CrowdStrike deployed a configuration file to all Windows machines that, when read by the kernel driver (sensor), caused an error. The sensor thought there were 21 input parameters in the configuration file, turns out there were only 20 in this one. It tried to access a parameter that did not exist, which caused the crash.

There were other things that went wrong that enabled this issue: incomplete testing of configuration files, lack of input count validation from the kernel driver and associated error handling. There must also be coordination between multiple teams when deploying these updates: the detection engineers, developers creating the kernel driver, Template Types, Content Validator, and Content Interpreter, as well as product security and QA.

Based on the root cause analysis, it seems like introducing configuration files with 21 parameters instead of 20 was done in the past year - which would make this a recent architectural change.

It’s worth noting that CrowdStrike does significant testing on its software updates, especially anything having to do with the sensor. A software update like you’d see with a kernel driver is used to update the underlying application (think patches, upgrades etc), while a configuration update (in this case) is used to update the detection logic. The distinction may seem slight, but it makes a big difference in this instance. CrowdStrike issues regular software updates, but has flexible versioning and rollout options, which help enterprises with batch deployments. Channel updates, however, are not subject to the same versioning as software updates. This is for several reasons: channel updates happen much more frequently, potentially multiple times a day depending on new malware, and the updates typically have a lower impact on the system.

How do we avoid this happening again?

CrowdStrike needs to update its testing and QA procedures, some of which we outlined above. That includes:

More complete testing of configuration files
Validating in the input count and implementing error handling
Aligning multiple teams working on various parts of the application architecture to ensure proper design decisions and testing are done.
Canary deployments, which basically means it will roll out configuration updates in batches instead of to every customer at once.

They may also need to rearchitect the sensor in the future to prevent these issues.

For users, this gets a little difficult. Some have suggested versioning or, even worse, local testing by the user for configuration updates - I don’t see how this is sustainable in the long term. It’s a nice idea, but the point of the tool is to protect the organization in real-time. If you implement versioning, it will limit what systems are protected fastest, which could lead to a breach. If you implement local testing by the user, the operational costs of managing the infrastructure to test each content update before you apply it will be extreme.

The best thing users can do is understand what happened and ask your endpoint security vendor, regardless of which one you use, what parts of the application take action in the kernel, what the risk is, and how they are mitigating that risk.

Allie Mellen

I am a computer engineer by training who has spent the past decade in engineering, research, and technical consulting roles at multiple venture-backed startups, as well as research roles at MIT and BU. I ran my own successful engineering and development consultancy for a number of years, where I also worked with multiple nonprofits to teach engineering and entrepreneurship to students and minorities. I got started in security as a hacker researching vulnerabilities in IoT devices, which culminated in a talk at Black Hat USA. Now, I am an analyst on the security and risk team at Forrester, where I am a frequent speaker at security conferences globally teaching about security and pushing the boundaries of the industry.

https://hackerxbella.xyz