Crash Crash Prevention Checklist Business Continuity Backup

New DBA Series: Please Don't CRASH Your Database AGAIN!!

Paul Koufalis

Read on for more soul crushing, database corrupting, career killing ways to CRASH your OpenEdge database!

Linux "Out of Memory Killer"

A Linux self-preservation mechanism, the OoMK terminates a process when memory is over-committed. Unsurprisingly, the OoMK kills the process consuming the most memory, which, most likely, is the database broker process.

A quick Google search will turn up numerous interesting articles on how to avoid this situation, including ways to tune "oom_adj". In practice, though, it is unlikely that your database will be spared and you are better off making sure that you have ample memory to handle the expected workload on your server.

A Simple Hostname Change

Simply put, the hostname is stored in the <dbname>.lk file and it must match the actual server hostname. If you change the hostname while a database is up... well... you know what is going to happen!

The log file will show:

<filename>: HOSTNAME is <hostname>, expected <hostname>. (4192)
<file-name> is not a valid .lk file for this server. (4196)
The Host parameter supplied was <host>, however, this Host is <host>. (5149)
SYSTEM ERROR: Can't attach shared memory with segment_id

and boom!

For more information, see this KB entry.

And while we're on the subject of the database lock file: never delete it while the database is open! This will instantly crash the database and could corrupt it, necessitating recovery from backup and roll forward (along with downtime and data loss). See this KB entry about corruption due to lock file deletion.

Out of Disk Space

How embarrassing. Not only will the database crash, but you risk corruption if BI notes are only partially written. This has to be one of the most inexcusable reasons for a database crash!

The larger your BI file tends to grow, the more available disk space you should have in your BI partition. When you restart your database after a crash, it needs to grow through crash recovery, redoing any unwritten committed transactions and undoing any uncommitted transactions. These changes require more BI writes, further growing the BI file. And if you run out of disk space during crash recovery, the next restart will require even more disk space, and so on. Monitoring the available space in your BI partition is vitally important.

Lock Table Overflow

Oufff... this one hurts because, similar to disk space, it is easily avoidable by implementing a tool like ProTop Monitoring and Advanced Alerting.

Every time the lock table fills, the broker uses 72 bytes of shared memory from the -Mxs (excess shared memory) pool to increase the size of the lock table by one entry. If this happens often enough, the -Mxs pool of shared memory is exhausted and the database crashes with errors similar to:

(14394) Out of free shared memory.
(6495) Out of free shared memory. Use -Mxs to increase.
(1185) Out of free shared memory. Use -Mxs to increase.

This KB entry goes into more detail on the error and explains in detail why this happens. When the database broker starts, it reserves some shared memory space (the -Mxs pool) for situations where it might need to expand a memory structure like the lock table. Once this space runs out, the broker is stuck: it cannot allocate more shared memory, but it cannot continue. The unfortunate result of this situation is an abnormal shutdown of the database.

Suffice to say, please monitor lock table usage and the lock table high water mark. ProTop trends this data over months and years and alerts you to impending problems, allowing you to proactively increase the size of -Mxs and pinpoint when exactly a change in your OpenEdge environment might have led to an increase in lock table usage.

Hardware Failure

From inexcusable we jump to inevitable: hardware failure. It is inevitable that your OpenEdge environment will get hit by some hardware failure, be it disk, controller, memory or anything in between. This is why we repeat and repeat: make sure you have a solid and tested recovery plan!

Human error

Here's a war story from a colleague who shall remain nameless:

"I was in a database directory and wanted to run a dbanalysis. My intent was to type:
proutil dbname -C dbanalys > dbname.dba

Instead, I typed:
proutil dbname -C dbanalys > dbname.db

Oops. Yes, I replaced the database’s control area with a dbanalysis report. Instant crash, obviously. Luckily, it was only a dev database, and I had an up-to-date structure file. So, I was able to recreate the .db file with prostrct builddb. Not my finest moment."

Lessons learned:

Always ensure you have an up-to-date structure file, e.g. add prostrct list to your nightly backup script.
More generally, never perform an OS write operation (create/update/delete files) in the database directory. Run the command somewhere else, or direct the output to a separate fully-qualified path.

Conclusion

The crashes might differ, but the lesson remains the same: learn from other DBAs' mistakes, or find creative ways to explain why you didn't put in place solutions to predict and prevent these common causes of database crashes!

Deploying the best OpenEdge Monitoring and Advanced Alerting Tool in the Galaxy probably wouldn’t hurt either.

Questions? Confused? Ask the experts.