Database Internals Crash Business Continuity Corruption

Insane Adventures In Version 8 Corruption

Paul Koufalis

Friday. 11:00 AM. I get a phone call from a managed services provider who knows nothing about OpenEdge. Their customer had been down since around 5:00 AM and no one knew what to do. Oh, and by the way, this is version 8 on Windows.

Wait…what…version 8 !?!

Yup.

The MSP called their usual Progress go-to guy and he said “Whoa!!! This is waaaay over my head. Call White Star Software.”

Do I need to add that their last good backup was 7 months old? I cannot make this stuff up.

Does anyone want to guess if after-imaging was enabled?

Is it bad doctor?

Yes, Billy. It’s bad. You’re already dead but you just don’t know it yet.

Here’s the error message:

SYSTEM ERROR: File e:\db\dbname.d17 too small -2147483648, blocksize 8192 extend failed.” (4524)

Seems simple enough: the variable length extent grew to 2 GB. It happens. But if you were around in the V6-7-8 days, you’re maybe starting to sweat a little bit around now.

Easy fix right? Uh…no…

If this were an OE10 or OE11 database, there would be a lots of options. None of these worked back in version 8.

Enable Large Files: Nope nope nope…did not exist in version 8.
prostrct add: Same error message.
probkup -norecover: I got it! I’ll run a probkup -norecover, which foregoes crash recovery, then restore into a new structure. BEEEEP!!! Fail.
procopy: Nope.
truncate bi: Nothing
truncate bi -F !?!: Zip. Nada. Niente.
_progres -RO: Easy peasy – I’ll connect in RO, dump the DB. But sadly, no dice.
dbrpr -F: I could get in, but none of the various scan and fix options helped.

We interrupt this story for this special announcement

I always take a physical copy of the DB to try crazy stuff on. I never ever EVER touch the real database. To all you people running tight on disk space, or laughing at us when we complain about SANs or RAID 5 being slow: this is why. Everything is super/great/perfect/amazing during normal times, but when time is critical and you need to do something unusual (like physically copy a database), your so-called cost saving measures end up costing you millions.

Meanwhile, back in the waiting room…

You know, in television hospital dramas, when the surgeon walks slowly and seriously into the waiting room, and the family stands up with that scared look on their face. I was the surgeon.

The database is on life support, 4 interns are taking turns doing CPR, and we need to know what your end-of-life instructions are. How long do you want to continue paying us to attempt [increasingly crazy] heroic measures to resuscitate the database? When do you accept your fate, restore the backup from 7 months ago, and…well..I don’t know what.

Just a little more please…

The customer asked us to keep trying, so it was time to try some less-than-supported ideas. I had some tricks up my sleeve, but I also thought it prudent to reach out to some of the smartest DBAs in the world. At this point, someone might have a crazy idea about something they did once back in 1999…

WARNING: DO NOT ATTEMPT ANY OF THE STUFF BELOW. It’s dumb. It’s crazy.

Absolutely insane attempt #1

Hey – the variable length extent is full, right? Prostrct add doesn’t work, so I’m going to hack the database with a hex editor to change the 2 GB variable length extent to a fixed length extent, and add a new variable length extent.

Did I mention that you shouldn’t try this at home?

The good news: it worked. The database recognized the new extent.

The bad news: still getting the same error message.

I was hoping the broker would “notice” the new extent and try to grow into it, rather than expanding the 2GB extent, but I was wrong.

Absolutely insane attempt #2

I identified the last few blocks in the variable length extent, then, using dbrpr, I reformatted them to empty blocks. Similar to my “add an extent” idea, I was hoping the broker could use those now empty blocks to do what it needed to do. Of course, I rebuild the free and RM chains so that the broker would know about these newly freed blocks.

Result: FAIL. Still getting the same error message.

Absolutely insane attempts #3-7

It’s now around 6pm, and I’m exchanging emails with a couple of my colleagues. George Potemkin from Progress Technologies LLC in Russia was still awake watching Premiere League football and he started feeding me ideas. Most of them I had already tried, and the rest failed. Things were not looking good. Until…

Absolutely insane attempt #8

Geoerge suggested that I physically shrink the file, so I chopped 32 KB from the end of the variable length extent using dd. For those that don’t know, dd is a simply UNIX utility to dump a disk or file, and you can specify things like block size and length. Luckily there is a Windows version of dd that I was able to use.

dd if=<the file> of=<another file> bs=1024 count=(4 blocks less than the 2 GB limit).

Next step: truncate bi: success.

_progres db -1: success

Holy smokes…the patient has a heartbeat.

Not out of the woods yet…

I need to do this on the real database now. But again, we never touch the real database, so I take another physical copy, prostrct repair, and go. SUCCESS !!

Time to dump and load. Remember that I literally chopped off the last 4 blocks from the database. I have no idea what was in there but whatever it was, it’s gone, so chances are the database is not in a very clean state. I suppose I could have hex dumped those blocks but by now I’m 8 hours into this journey with a D&L of a type 0 storage area database (i.e. everything in the “schema area’) ahead of me.

The binary dump and load, surprisingly, completed without errors. No fragments. No index issues. 4 hours later I had a brand new database. There was likely some logical data corruption in there, related to the chopped records, but physically it was all good.

A few take home lessons

1. If you’re running your business on version 8 or version 9, you’re insane.

2. If you’re not monitoring, you’re insane. Yeah yeah I know we sell ProTop Monitoring and Alerting subscriptions but there’s a reason ProTop exists: to protect you against these kinds of things. The excuses I hear are mind blowing. “We already have monitoring” is my favourite. Suuuure you do. I hear the Toronto Maple Leafs are going to win the Stanley Cup this year, too. Somebody call Bill Barilko.

3. Again, for the disk space misers out there: I ended up needing space for 7 copies of the database: the original one (1), the one I was defiling with my crazy stunts (2), the 7-month old back file (3), the restore of the 7-month old backup (4 – because we had to check if it was a usable backup), the copy on which to attempt the solution (5), the dump files (6), and the new database (7). Do you have enough disk space available to do that? And is it fast disk space? Or will each copy/restore take 4 hours? Now do you understand why we harp on disk IO bandwidth?

4. There’s a reason why White Star Software is often referred to as “the last phone call”. We get the call after everyone else has given up.

5. Treat your friends and colleagues well. Always. We all need each other’s help at some point or other.

6. If you’re running your business on version 8 or version 9, you’re insane.