Dump and Load Strategies Part II: Dumping Data
If you read Part I, you’ve already analyzed your data and designed your Type II storage areas and you’re ready to dump your existing data and load it into the new structure. If you’re still using Type I storage areas, this will probably be the slowest part of the entire D&L process so don’t despair. The load and the index rebuild should take considerably less time than the dump.
You have three choices when it comes to dumping your data:
1. ASCII Dictionary Dump
2. ASCII Custom Dump
3. Binary Dump
This blog post is going to concentrate on the binary dump as that’s what most of you are going to use. Unless the database is very small or you need to do some conversions during the dump, there’s really no reason to use the dictionary dump for anything other than users and sequence current values.
What’s a Binary Dump?
Basically the binary dump utility takes each record, exactly as it is stored in the physical database, and dumps it to a file. There are no conversion, formatting, century or other such issues to worry about and as a result, the binary dump is the fastest method to dump your data. Unless you say otherwise, the primary index will be used (more on this later). The basic syntax is:
_proutil <DB> -C dump <table name> <output directory>
Example: _proutil sports -C dump customer /scratch/dump
The astute observer may be saying to himself “Wait a second! I have to run a separate command line _proutil for each table?” Why yes that is correct Sir! Fortunately for you, this blog post has everything you need.
Generating a Binary Dump Script
This very simple 4GL program is all you need:
OUTPUT TO mydb.bin-dump.ksh.
FOR EACH _file WHERE _tbl-type = “t” NO-LOCK:
PUT UNFORMATTED “_proutil $DB -C dump ” + _file-name + ” /scratch/dump” SKIP.
END.
OUTPUT CLOSE.
Warp Speed Mr. Sulu
I must admit, so far this post has been a big disappointment. If you made it this far I thank you from the bottom of my heart and plan to reward you with more interesting tidbits: how to make it FAST!!
Let’s start with the database. Stop everything running on the server and jack up the -B to as big as the server can handle without swapping. And while I haven’t specifically benchmarked lruskips during dumps, I have a feeling using -lruskips 100 or more will help (let me know if you try it). Admittedly, I am glossing over a geeky technical discussion of -B usage during dumps with Type I versus Type II storage areas and you may get equal performance with a smaller -B, but I doubt you’ll do much harm with a big -B as long as the server doesn’t start swapping. One last point: don’t forget to increase -n if necessary to support multiple concurrent dumps.
Now for the actual dump processes: I’m usually a lazy pig and I just start all the dumps in parallel. This is not optimal but typically 95% of the dumps finish in the first 5 minutes and then you’re left with those 5 big tables. Optimizing this is usually an exercise in futility because no matter how you tweak it, you’re left waiting for the 1-2 monster tables to finish dumping. If you’re really, truly, deeply offended by such an egregious use of system resources, then just start the 5 or so big table dumps in parallel and run all the rest in series. Oh – and read the section on multi-threaded dumps below.
Time Keeps on Ticking, Ticking, Ticking…into the Future…
I know exactly where you are right now as I have been here a hundred times. 99% of the tables dump in minutes. Three or four of the tables dump in about an hour or two. ONE <CENSORED> table has been dumping for the past 12 hours with no end in sight. This situation was the inspiration for my presentation “27 Days to 27 Hours” – another consultant told a customer that he needed 27 days to dump that one table. Hmm…I wonder why the customer asked for a second opinion?
A number of options are available to you:
Pick a different index: A lot of these monster tables are created and physically laid out by time but the primary index is something like customer number. During the dump, you end up jumping all over the place, reading and re-reading the same database blocks because the dump order doesn’t match the physical order of the data.
Rebuild and/or compact the index: In databases that have not been maintained in years, the index you are using to dump might be in horrible shape. Sometimes planning an index rebuild or compact the week before the dump can save you hours.
Use no index: If the data is in a Type II storage area and you’re using 10.2A or later, you can tell the dump utility to forego usage of any index and simply dump the data in whatever order it already is in the data blocks. Careful though: there were bugs in early implementations (see http://knowledgebase.progress.com/articles/Article/P180607 for example).
Create an index: If the dump order is not as important as the dump-and-load down time, create a small index on a tiny field. I have lived through situations where the time to create the index, build it then dump the data was still much faster than dumping with one of the existing indexes.
Use multi-threaded dump: Use “-thread 1 -threadnum <#>” to specify the number of parallel threads to use to dump a table. Note that the dump utility might ignore you. See http://knowledgebase.progress.com/articles/Article/P141020. CAREFUL HERE: if after an hour you have the last 4 big tables dumping and each one is using 4 threads, that’s 16 threads running. If your box has 8-16 CPUs then you’re probably in good shape. If it has 2 CPUs then you may want to reconsider my earlier advice to just run everything in parallel.
Use dumpspecified: If this is a historical table and you know that old data will never change, dump it ahead of time using “_proutil <db> -C dumpspecified”. If you’re paranoid that data might change, you can write a program that does a BUFFER-COMPARE() of the records in the old database versus the new database. Run the program after you finish the dump and load and if by some chance a record or two changed, simply update the new DB. Your only risk is that someone will access the “old” records in your first day or so using the new DB in production.
Miscellaneous Best Practices
I shouldn’t need to tell you this, but I know you so listen up!
1. Make sure you have a good backup if your plan involves modifying the database in any way
2. If you can, run an idxcheck on the database in the days or weeks before the dump. This can be run online.
3. Make sure no one can accidentally login to the database and modify data while you’re dumping
1. Restart the DB without any “-S” brokers (4GL and/or SQL)
2. Rename $DLC/bin/_progres to $DLC/bin/_progres.TOUCHE_PAS
4. Generate a table analysis before the dump
5. Compare the number of records in the table analysis versus the number of records dumped by binary dump
6. WRITE out a detailed maintenance plan. One of my biggest pet peeves is people who do DB maintenance without a written plan. Like you’re not going to forget to chmod something at 4:00 AM.
7. Test, test and retest your dump process
Stay tuned for Part III coming soon to a theatre near you. And as always, your comments and questions are welcome.
Paul
White Star Software