CrowdStrike IT outage affected 8.5 million Windows devices, Microsoft says

MicroWave@lemmy.world · 3 months ago

CrowdStrike IT outage affected 8.5 million Windows devices, Microsoft says

thisbenzingring@lemmy.sdf.org · 3 months ago

All i know is that I had to personally fix 450 servers myself and that doesn’t include the workstations that are probably still broke and will need to be fixed on Monday

😮‍💨

qjkxbmwvz@startrek.website · 3 months ago

Is there any automation available for this? Do you fix them sequentially or can you parallelize the process? How long did it take to fix 450?

Real clustermess, but curious what fixing it looks like for the boots on the ground.

thisbenzingring@lemmy.sdf.org · edit-2 3 months ago

Thankfully I had cached credentials and our servers aren’t bitlocker’d. Majority of the servers had iLO consoles but not all. Most of the servers are on virtual hosts so once I got the fail over cluster back, it wasn’t that hard just working my way through them. But the hardware servers without iLO required physically plugging in a monitor and keyboard to fix, which is time consuming. 10 of them took a couple hours.

I worked 11+ hours straight. No breaks or lunch. That got our production domain up and the backup system back on. The dev and test domains are probably half working. My boss was responsible for those and he’s not very efficient.

So for the most part I was able to do most of the work from my admin pc in my office.

For the majority of them, I’d use the Widows recovery menu that they were stuck at to make them boot into safe mode with network support ( in case my cached credentials weren’t up-to-date). Then start a cmd and type out that famous command

Del c:\windows\system32\drivers\crowdstrike\c-00000291*.sys

I’d auto complete the folders with tab and the 5 zero’s … Probably gonna have that file in my memory forever

Edit: one painful self inflicted problem was my password is 25 random LastPass generatied password. But IDK how I managed it, I never typed it wrong. Yay for small wins

magikmw@lemm.ee · 3 months ago

You need to boot into emergency mode and replace a file. Afaik it’s not very automatable.

Jtee@lemmy.world · 3 months ago

Especially if you have bitlocker enabled. Can’t boot to safe mode without entering the key, which typically only IT has access to.

magikmw@lemm.ee · 3 months ago

You can give up the key to user and force a replacement on next DC connection, but get people to enter a key that’s 32 characters long over the phone… Not automatable anyway.

HeyJoe@lemmy.world · 3 months ago

Servers would probably be way easier than workstations if you ask me. If they were virtual, just bring up the remote console and you can do it all remotely. Even if they were physical I would hope they have an IP KVM attached to each server so they can also remotely access them as well. 450 sucks but at least they theoretically could have done every one of them without going anywhere.

There are also options to do workstations as well, but almost nobody ever uses those services so those probably need to be touched one by one.

prashanthvsdvn@lemmy.world · 3 months ago

I read this in a passing YouTube comment, but I think theoretically be possible to setup an ipxe boot server that sets up an Windows PE environment and can deploy the fix there and then all you have to do in the affected machines is to configure the boot option to the ipxe server you setup. Not fully sure though if it’s feasible or not.

LavenderDay3544@lemmy.world · 3 months ago

deleted by creator

thisbenzingring@lemmy.sdf.org · 3 months ago

Because my expertise is Windows and that’s the environment I get paid to administer. We have Linux servers too but they didn’t have any of these problems. BUT they have had their own issues in the past and finding Linux system admins isn’t really as easy as you might expect. Running your own Linux system at home is not the same as running a 175TB CEPH Node