New DBA Series: 99 Ways to CRASH a Database
The best way to avoid disasters is to learn from mistakes, preferably someone else’s. Memorize this series or polish your resume-writing skills!
Nectar Daloglou of OmegaServe joins us this week to share some of the less-than-optimal events that he’s seen over his 20+ year career as a QAD technical expert and OpenEdge DBA. Subscribe to our mailing list and keep an eye out for his follow-up blogs featuring more of these career-limiting blunders!
Human Error: Bob crashed the database. Don’t be Bob.
Human error is defined as something being done that was “not intended by the actor; not desired by a set of rules or an external observer, or that led the task or system outside its acceptable limits”. In short, it is a deviation from intention, expectation, or desirability.
Blablabla…who are we kidding? Someone screwed up, and hopefully that someone wasn’t you! Maybe someone shut down a database (or a server) because they were looking at the wrong putty window or remote desktop connection. Or, perhaps, they wrote a cleanup script that really, I mean REALLY, cleaned things up…
Delete delete delete your career…
Deleting or overwriting a critical file can be, depending on the file, an unrecoverable error. Get out your backup and your AI files. You do have AI enabled, don’t you?
Pesky wild cards can ruin your day:
rm *.d*
deletes dbname.db, dbname.d1, etc.
rm * .d*
Notice the extra [accidental] space? This will return an error “.d* not found” because the “rm *” deleted everything in the current directory!
Redirecting output might obliterate your data:
echo > dbname.db
is not the same as
echo dbname.db
The former will empty the file. Oops. On the plus side, you can recreate dbname.db from the structure file and the prostrct builddb command, assuming you have an up-to-date structure file. It is not so simple with other extent types. If you zero-out a data extent you will need to restore and roll-forward. Or fail-over to your OpenEdge Replication hot spare.
Test first …
If you must delete, verify! List the files that will be deleted first:
ls filestodelete*
or
echo rm filestodelete*
PRIVILEGED USERS ALWAYS PROCEED WITH CAUTION: Reduce the likelihood of deleting/overwriting files by using root and other privileged user accounts (like the account that owns the database files) only when necessary, and even then, be careful.
Don’t push the red button (unwittingly)!
An accidental server shutdown is a bad way to start the day. A “fat finger” can turn a “sign out” into a “shut down”.
Communication breakdown
The project team (left hand) not knowing what the maintenance team (right hand) is doing, can lead to ”Unplanned” planned outages like a server decommission, point-and-click VM shut down, you name it. Yes, we have seen production VMs stopped and deleted.
Monitor your status from a different server
If your alerts come from the production server and the production server is offline, then you won’t get any alerts…which is the expected behaviour when the system is running normally.
Wait…what!?!
If you rely on alerts generated from the same server as the one being monitored, then you are at risk of not getting any alerts at all. This happens all the time!
ProTop generates a “no heartbeat” alert if it stops receiving data from a monitored server, alerting support staff to the interruption in data flow and monitoring. And our global network of ProTop web portals monitor each other, ensuring that each can send alerts to you and to us! We monitor the monitor *and* the monitors of the monitor.
Proshut, bloody proshut
Please do not use the interactive proshut menu unless you actually really truly intend to shut down the database. Please. The “2” (Unconditional shutdown) is just a little to close to the “1” (Disconnect a user) for my liking.
If you need to disconnect a user, use “proshut DB -C list” to get a list of connected users, then “proshut DB -C disconnect <user#>” to disconnect the user.
To err is human …
You would do well to commit the cautions above to a place in your memory that rings a bell when you think to use any of these commands, especially in production. They should give you pause BEFORE you are forced into exercising your disaster recovery plan… and finding out just how divinely forgiving your boss’s boss is.