User Died, DB Crashed, Business MAD!
Anyone who has been around the OpenEdge world for a while has had to face the dreaded “User died holding shared memory latch. ABNORMAL SHUTDOWN” crash. While this sometimes happens as a result of a Progress OpenEdge bug, more often than not it’s a human error. Here are three key points to understand in order to minimize the chance of crashing your database when trying to terminate a shared memory client.
The UNIX kill command doesn’t kill anything
The ill-named kill command simply sends a signal to a process: SIGHUP, SIGINT, SIGKILL… Some of those signals could be fatal, but mostly it’s the job of Progress’ C programmers to incorporate signal handling code in the OpenEdge executables. For the OpenEdge AVM, the common signal handlers are documented and, unless you’re running an old version, are fairly well behaved. SIGINT will cause the AVM to restart from its start-up program. SIGHUP will simulate a user hanging up his terminal session and gracefully end the AVM. SIGXCPU and SIGFPE are usually fatal and well behaved. Usually – be careful!
Two warnings:
- I strongly suggest that you use the signal names, not the numbers (kill -s SIGHUP rather than kill -2) as the name/number combinations are not the same on all platforms.
- Even if SIGHUP, SIGINT and SIGTERM are typically well digested by the AVM, ANY signal sent via kill could cause the database to crash and there is no 100% safe way to terminate a shared memory client. Caveat Emptor !!
The dangers of untrappable signals
The real danger comes from untrappable signals like SIGKILL (the famous “kill -9”) as the AVM is not given a chance to cleanup and exit gracefully. The process simply disappears and the watchdog process or broker is left to cleanup any mess. If that mess included holding a shared memory latch, the broker shuts down to ensure DB integrity. The lesson here is simple: JUST DON’T USE kill -s SIGKILL (kill -9).
àTrigger-happy DBAs and sysadmins
The most common human error, besides using kill -9, is impatience. The DBA/sysadmin send a SIGHUP or SIGINT but nothing seems to happen. Worse, the process seems to be using CPU and doing disk IO so the DBA assumes that the process is ignoring him (and nobody likes to be ignored). The DBA repeats the kill command, or uses a different kill signal, escalating and eventually causing a database crash. Don’t do this. In reality, the process was likely undoing the work that was interrupted by the first signal, and, after it finishes its cleanup, will exit gracefully.
Instead follow these steps:
- Monitor the _Connect._Connect-Disconnect flag in the Virtual System Tables (VSTs). If it contains a value of 1, the process received your signal and is working on it
- Monitor the _UserIO table to see if the process is doing anything in the database
- Check if the process is connected to multiple databases. You may think the process is doing nothing when it’s doing something in another DB
- Check if the process has any children. If the AVM spawned a child (for example, a print job or a file transfer) and that child is stuck, the parent cannot die until the child goes away
- If the process still seems unresponsive, check the process’ open files with lsof and check for open shared memory segments using pmap (svmon on AIX). If the process has no database files open and, especially, no database shared memory segments attached, it is likely safe to use a more fatal kill signal. YMMV
- Use killprosession: This free tool is available as part of the ProTop download at https://protop.com/download/. Killprosession automates these steps and a few more, taking the guesswork out of killing a shared memory client
Killprosession
Paul Koufalis first wrote killprosession almost 20 years ago. He was working at a customer with around 1500 concurrent shared memory users and the need to terminate shared memory processes was an almost daily occurrence, eating up a considerable amount of the local DBA’s time. With killprosession, when a long-running open transaction was detected, or if an important business process was blocked , an alert was generated and sent to the help desk, not the DBA! The help desk would follow a preset script: a) contact the user; b) remote desktop into their PC; c) killprosession. If the problem persisted or the situation deviated from their script, it was then escalated to the DBA. Killprosession tells you exactly what the process is doing (or not doing), helping you make an informed decision about what to do next. It’s included with the free ProTop download.
Safety First
No matter the situation, your first responsibility is to protect the database. It does happen occasionally that an OpenEdge client process just won’t die, and you may need to plan a shutdown to safely get rid of it. Be patient, avoid untrappable signals, use killprosession to automate the process and please don’t hesitate to reach out to one of the experts at WSS if you have any questions.