CrowdStrike与棘手的蓝屏错误 (BSOD)
2024-7-22 14:1:5 Author: mp.weixin.qq.com(查看原文) 阅读量:0 收藏

虽然软件更新偶尔会引发一些问题,但像CrowdStrike事件这样的大规模事故并不常见。我们(此处指微软官方)目前估计,CrowdStrike的更新影响了850万台Windows设备,占所有Windows机器的不到百分之一。虽然受影响的比例很小,但由于CrowdStrike被许多处理关键服务的企业使用,这次事件的经济和社会影响依然非常广泛。

摘录信息来源:https://blogs.microsoft.com/blog/2024/07/20/helping-our-customers-through-the-crowdstrike-outage/

阅读完原文后会发现一个现象:

看到一个明显的情况就下意识想到提出一个机制解决当前遇到的问题,然而为了解决一个潜在的问题而引入更复杂的机制同时也会面临复杂机制层次下的安全问题,所以Windows在设计的时候就一定会考虑安全的问题。但简单会增强健壮性,而复杂会增加安全性,既要又要不符实情,两相其害取其轻,找到一个收益与风险的平衡点确实比较重要。

文章原文双语对照翻译如下:

Millions of machines around the world crashed a few days ago, showing the dreaded “Blue Screen of Death” (BSOD), affecting banks, airports, hospitals, and many other businesses, all using the Windows OS and CrowdStrike’s Endpoint Detection and Response (EDR) software. What was going on? How can such a calamity happen in one single swoop?

几天前,全球数百万台电脑突然崩溃,出现了令人恐惧的“蓝屏错误”(BSOD),影响了银行、机场、医院以及许多其他使用Windows操作系统和CrowdStrike端点检测与响应(EDR)软件的企业。究竟发生了什么?为什么这么大的灾难会在瞬间爆发?

First, foul play was suspected – a cyber security attack perhaps. But it turned out to be a bad update of CrowdStrike’s “Falcon” software agent that caused all this mess. What is a BSOD anyway?

起初,人们怀疑这是一起网络攻击导致的恶意行为。但事实证明,罪魁祸首是CrowdStrike的“Falcon”软件代理的一次糟糕更新。那么,什么是蓝屏错误(BSOD)呢?

Code running on Windows can run in two primary modes – user mode and kernel mode. User mode is restricted in its access, which cannot harm the OS. This is the mode applications run with – such as Explorer, Word, Notepad, and any other application. Kernel mode, however, has (almost) unlimited power. But, as the American hero movies like to say, “With great power comes great responsibility” – and this is where things can go wrong.

Windows上的代码可以在两种主要模式下运行——用户模式和内核模式。用户模式的权限受到限制,不会对操作系统造成危害。这是应用程序运行的模式,比如Explorer、Word、Notepad等。而内核模式则拥有(几乎)无限的权限。但是,正如美国英雄电影常说的,“伟大的权力伴随伟大的责任”——这也是问题可能出现的地方。

Kernel code is trusted, by definition, because it can do anything. The Windows kernel is the main part of kernel space, but third-party components may be installed in the kernel, called device drivers. Classic device drivers are required to manage hardware devices – connecting them to the rest of the OS. The fact that you can move your cursor with the mouse, see something on the screen, hear game sounds, etc., means there are device drivers dealing with the hardware correctly, some of which are not written by Microsoft.

内核代码因为能做任何事情而被默认信任。Windows内核是内核空间的核心部分,但也可以安装第三方组件,这些被称为设备驱动程序。传统的设备驱动程序用于管理硬件设备——将它们与操作系统其他部分连接起来。比如,你能够用鼠标移动光标、在屏幕上看到内容、听到游戏声音,这些都说明有设备驱动程序在正确地处理硬件,其中一些并不是微软编写的。

If a driver misbehaves, such as causing an exception to occur, the system crashes with the infamous BSOD. This is not some kind of punishment, but a protection mechanism. If a driver misbehaves (or any other kernel component), it is best to stop now, rather than letting the code continue execution which might cause more damage, like data corruption, and even prevent Windows from starting successfully.

如果驱动程序出现问题,比如引发异常,系统会崩溃并显示蓝屏错误(BSOD)。这不是一种惩罚,而是一种保护机制。如果驱动程序(或其他内核组件)出现问题,最好立刻停止运行,而不是让代码继续执行,这样可能会造成更大的损害,比如数据损坏,甚至导致Windows无法正常启动。

Third party drivers are the only entities that Microsoft has no full control over. Drivers must be properly signed, but that only guarantees that they have not been tampered with, as it does not necessarily guarantee quality.

第三方驱动程序是微软无法完全控制的部分。虽然这些驱动程序必须经过正确签名,这能确保它们没有被篡改,但并不能完全保证它们的质量。

Most Windows systems have some Anti-virus or EDR software protecting them. By default, you get Windows Defender, but there are more powerful EDRs out there, CrowdStrike’s Falcon being one of the leaders in this space.

大多数Windows系统都配有防病毒或EDR软件来保护自己。默认情况下,系统自带的是Windows Defender,但市场上还有一些更强大的EDR解决方案,比如CrowdStrike的Falcon,它在这一领域中是领先者之一。

The “incident” involved a bad update that caused a BSOD when Windows restarted. Restarting again did not help, as a BSOD was showing immediately because of a bug in the driver when it’s loaded and initialized. The only recourse was to boot the system in Safe Mode, where only a minimal set of drivers is loaded, disable the problematic driver, and reboot again. Unfortunately, this has to be done manually on millions of machines.

这次“事件”是由于一次错误的更新导致的,Windows重启时会出现蓝屏错误(BSOD)。重新启动也无济于事,因为驱动程序在加载和初始化时会立即引发蓝屏。唯一的办法是启动系统到安全模式,在安全模式下只加载最基本的驱动程序,禁用有问题的驱动程序,然后再重新启动。不幸的是,这需要在数百万台机器上手动进行。

The Windows kernel treats all kernel components in the same way, regardless of whether that component is from Microsoft or not. Any driver, no matter how insignificant, that causes an unhandled exception, will crash the system with a BSOD. Maybe it would be wise to somehow designate drivers as “important” or “not that important” so they may be treated differently in case of failure. But that is not how Windows works, and in any case, an anti-malware driver is likely to be tagged as “important”.

Windows内核对所有内核组件一视同仁,无论这些组件是否来自微软。任何驱动程序,只要引发未处理的异常,都会导致系统崩溃并出现蓝屏错误(BSOD)。或许可以考虑将驱动程序标记为“重要”或“不那么重要”,以便在发生故障时采取不同的处理措施。但Windows的处理方式并不是这样,而且防病毒驱动程序通常会被视为“重要”。

This entire incident certainly raises questions and concerns – a single point of failure has shown itself in full force. Perhaps a different approach to handling kernel components should be considered.

这次事件无疑引发了许多问题和担忧——一个单点故障的影响被彻底暴露出来。也许应该考虑采用不同的方法来处理内核组件。

Personally, I was never comfortable with Windows uniform treatment of kernel components and drivers, but in practice it’s unclear what would be a good way to deal with such exceptions. One alternative is to write drivers in user mode, and many are written in this way, especially to handle relatively slow devices, like USB devices. This works well, but is not good enough for an EDR’s needs.

个人而言,我对Windows统一处理内核组件和驱动程序的方式一直感到不太满意。不过,实际上也不清楚有什么更好的方法来应对这些异常。一种替代方案是将驱动程序编写在用户模式下,许多驱动程序,特别是用于处理较慢的设备,如USB设备,都是这样做的。这种方法效果不错,但对于EDR的需求来说,并不够理想。

Perhaps specifically for anti-malware drivers, any unhandled exception should be treated differently by disabling the driver in some way. Not easy to do – the driver has typically registered with the kernel (e.g. PsSetCreateProcessNotifyRoutineEx and others), its callbacks may be running right now – how would the kernel “disable” it safely. I don’t think there is a safe way to do that without significant changes in driver protocols. The best one case do (in theory) is issue a BSOD, but disable the offending driver automatically, so that the next restart will work, and a warning can be issued to the user.

针对防病毒驱动程序,可能需要在出现未处理的异常时采取不同的处理方式,比如通过某种方法禁用这个驱动程序。这并不容易实现,因为驱动程序通常已经在内核中注册(比如通过 PsSetCreateProcessNotifyRoutineEx 等接口),其回调可能正在运行——所以内核如何安全地禁用它呢?在没有对驱动程序协议进行重大改动的情况下,做到这一点非常困难。理论上,最好的办法是触发蓝屏错误(BSOD),同时自动禁用有问题的驱动程序,以便下次重启时系统可以正常启动,并向用户发出警告。

This is not ideal, but is certainly better than the alternative experienced in the recent crash. Why is it not ideal? First, the system will be unprotected, unless some default (like Defender) would automatically take over. Second, it’s difficult to determine with 100% certainty that the driver causing the crash is the ultimate culprit. For example, another (buggy) driver may write to anywhere in kernel space, including memory that may belong to our anti-malware driver, and that write operation does not cause an immediate crash since the memory is valid. Later, our driver will stumble upon the bad data and crash. Even so, some kind of mechanism to prevent the widespread crash must be set in place.

这虽然不是最理想的解决方案,但肯定比最近经历的崩溃要好。为什么说它不是最理想的呢?首先,系统将会失去保护,除非有一些默认程序(比如Windows Defender)能自动接管保护功能。其次,很难百分之百确认导致崩溃的驱动程序就是唯一的原因。例如,其他有问题的驱动程序可能会写入内核空间中的任意位置,包括可能属于我们防病毒驱动程序的内存,而这种写入操作可能不会立即导致崩溃,因为内存仍然有效。后来,我们的驱动程序可能会遇到这些损坏的数据,从而引发崩溃。尽管如此,还是需要某种机制来防止大规模的系统崩溃。

翻译来源:

https://scorpiosoftware.net/2024/07/21/crowdstrike-and-the-formidable-bsod/


文章来源: https://mp.weixin.qq.com/s?__biz=MzUyMTUwMzI3Ng==&mid=2247485555&idx=1&sn=5a789ed1e89c543536ebf953411fd778&chksm=f9db5f30ceacd62649065ebfcb748b27c4e0afb0c83212053dbfcf10af13842b2f2ba78f0b38&scene=58&subscene=0#rd
如有侵权请联系:admin#unsafe.sh