When the Crowd Strikes!
ProTop saw the issues first and sent out warnings to our customers.
So, this happened during the Crowdstrike event.
It is quite worrying as a DBA to watch resources drop off the network one by one. It is somehow worse than everything disappearing all at once.
You are already planning a response to scenario A, B, and C so each new change forces you to re-evaluate how complex this recovery could be and how it needs to be sequenced. Asking at which point you will be needed in each customer's response plan for those scenarios.
As my on-call shift started last Friday, the ProTop Portal began alerting me (see the alert image below) of a missing heartbeat for one of our Windows-based ProTop customers. This was the first indication that this was going to be a rough day in IT.
I checked the ProTop Portal to see if the customer was performing maintenance and had simply forgotten to schedule an outage or set ProTop to maintenance mode.
Unfortunately, there were no signs of a clean DB shutdown before the heartbeat stopped. (In a glass half-full sort of way, there were also no signs of an abnormal shutdown).
I called the customer's after-hours support number, and their tech was at home getting ready for work. I informed him that all of their resources had stopped sending a heartbeat and this might be a network-related issue. We rang off, and he said he would log in to see what was going on.
A short while later, I started to see noise on another customer's chat group indicating that some of their developers were getting a BSOD (Blue Screen of Death) loop on their PCs and that this was affecting servers across the network. They used Crowdstrike AV across the board, and the impact was high. Below is a short message exchange I had with one of our customers.
I called this customer back to enquire as to whether or not they were using Crowdstrike security, and the answer was yes. They also mentioned that their DNS and VPN servers were affected, so he would have to drive into the office to have a look. We had a possible root cause.
A short while later, I received a PAGE to say another ProTop customer's primary database server had stopped sending a heartbeat. I called their IT support desk (who were not having a good day!) and were very glad to receive information about the possible root cause.
As updates propagated through the network, the customer's DR server also stopped sending a heartbeat. You can also see the replication target's response to the source going down. Due to the severity of the issue in question this information can now be used for audit purposes, to calculate response times and recovery times that can be fed back into the customer's DRP (Disaster Recovery Plan) and BCP (Business Continuity Plan).
This is where working with a team of the world's best DBAs and Progress experts makes all the difference. At the point where my attention would be divided too far I was able to reach out to my colleagues and hand the overflow over so that no customer would be left queuing for our assistance on a day when they needed us.
While some disasters are not avoidable, having ProTop as a reliable OpenEdge monitoring tool allowed our customers to detect the problems early and take action even when their server's had technically stopped working. Once the dust had settled, having all the events recorded in real-time in the ProTop Portal provided insights to our customer's teams to review.