This week I will be doing a deep-dive of the CrowdStrike outage, to provide you with a summary of this global outage.
Last Friday, the business world woke up to the news that CrowdStrike, a large cyber-security firm who provide products widely used by many businesses, had released a software update that had disabled many Windows-based computers globally.
CrowdStrike is a large company with some 8 thousand employees, with their head quarters in Austin, Texas.
They were founded in 2011, and their main area of expertise is providing endpoint security.
In plain English, that means running security software like virus detection on local machines, including Windows PCs. Ref: https://en.wikipedia.org/wiki/CrowdStrike
Last week on Friday the 19th of July, CrowdStrike released a software update for their vulnerability scanner called Falcon Sensor.
A bug in that update caused Windows PCs running it to experience a Blue Screen of Death (BSOD) on start-up, effectively disabling the computers as they could not start-up normally.
As this update was sent out over the Internet to a huge amount of customers running Falcon Sensor, the impact was felt globally and many industries were impacted.
Many airline flights were cancelled, banking and healthcare services were impacted, and even 911 call centers were impacted.
By the end of trading Friday, CrowdStrike shares were down 11.0%.
When you think about how widespread the usage of Windows is in the corporate world, we are very vulnerable to these kinds of supply-chain issues.
Technologies like CI/CD make the delivery of software to remote machines very efficient, even if that software contains damaging bugs.
There they mentioned it was a bad configuration issue issue in Senor, rather then bad code.
The work-around involves starting the impacted Windows PCs in Safe Mode, and then finding the manually deleting the bad file from the update.
That sound okay in theory, but what if an organization has tens of thousands of such PCs that require this manual work-around?
The problem is the delivery mechanism for the initial update, namely the PC having an internet connection to receive it, could not be used to deliver the fix as those affected PCs were now offline!
CrowdStrike concluded their blog entry from the 20th of July by stating: "We understand how this issue occurred and we are doing a thorough root cause analysis to determine how this logic flaw occurred. This effort will be ongoing. We are committed to identifying any foundational or workflow improvements that we can make to strengthen our process. We will update our findings in the root cause analysis as the investigation progresses."
I look forward to reading that final Root Cause Analysis when they publish it.
Historically, this will go down as being the largest IT outage of all time. It reminds me of the Y2K threat, but that was mitigated successfully due to advanced warning.
This time there was no warning.
What I am working on this week:
Clearing binary logs from an old MySQL instance - disc space usage went from 94% to 9%.
Media I am enjoying this week:
Aegeon Science Fiction Illustrated issue 4
Download audio
File details: 14.8 MB MP3, 10 mins 15 secs duration.
Five.Today is a highly-secure personal productivity application designed to help you to manage your priorities more effectively, by focusing on your five most important tasks you need to achieve each day.
Our goal is to help you to keep track of all your tasks, notes and journals in one beautifully simple place, which is highly secure via end-to-end encryption. Visit the URL Five.Today to sign up for free!