Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Unix Operating Systems Software

Can You Compress Berkeley DB Databases? 10

Paul Gear asks: "I'm wanting to create a database using Berkeley DB that includes a lot of textual information, and because of the bulk of the data and its obvious compressibility, I was wondering whether it was possible to have the DB libraries automatically compress it at the file level, rather than me compressing each data (rather small) item before putting it in (which would result in much less gain in compression). Section 4.1 of the paper "Challenges in Embedded Database System Administration" talks about automatic compression, but that is the only place in the documentation that it is mentioned. Can anyone point me in the right direction?"
This discussion has been archived. No new comments can be posted.

Can You Compressing Berkeley DB Databases?

Comments Filter:
  • by sql*kitten ( 1359 )
    Which OS are you using? NT can compress files transparently to applications. Just right-click the file, select "Properties" then check the appropriate box. You won't need to modify any of your code.
  • look at LZO.
    http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html

    I've used it in an embedded app to decompress/overlay main applications from rom to ram and can vouch for it's decompression speed.
  • Reading through a compressed random-access file may not be as big of a win as you think, since the db will need to decompress things to determine where your data is. OTOH, if you have a fast processor and slower media (CDROM) and plenty of RAM, the drawbacks will disappear after a few spins through due to caching.
  • The mifluz project [senga.org] uses a compressed Berkeley DB for word indexing. Depending on your application, you may find much of the code less useful, though the db/ directory contains code for creating compressed Berkeley B-Tree databases.

    You should consider contacting Loic Dachary--his address is on the Senga project pages.

  • If you use zlib's replacements for fread, fseek, etc. things will be VERY slow. I tried this aproach with WordNet [princeton.edu] databases and it sucked. Well, WordNet does a lot of seeks, but still.

    Teaching the Berkeley DB functions to use libz will be extremely painful too. I'd say, your best bet would be to try the new Linux's filesystem extension, mentioned already by someone else. But I'm not sure how efficient it is with respect to reading/uncompressing/compressing back things, which are read and/or mmaped. In any case, you will not be modifying any code -- just the file's attribute.

    I'm afraid, you'll defeat most of the DB's tricks, that know about the sector size and other file-system details.

    You could also just uncompress the file into memory and then give it to the database functions, but then you loose the automatic syncronization with the file on the filesystem, which for many is the main reason of using Berkley DB (or gdbm) in the first place.

  • I know this is probably not an option, but have you considered switched to a different database system like one based in SQL? Of course, as I say this, I'm doing web applications for my company using the DB database. I just wanted to know how much of a hassle it would be to switch. I'm sure we could all go into the many virtues and pitfalls of all database systems out there.
  • FWIW, linux can do this on the ext2fs as well, with chattr +c filename. There are analogs in other unix operating systems and filesystems as well. :-) (man chattr for more info)


    --
  • Although compressing the data might seem like a good idea, in practice you will curse day and night unless you take certain precautions. Using a compressed filesystem will save you plenty of space, but you will pay in the form of cpu usage. Decompressing data on-the-fly is costly, and in the case of database operations, it can be downright nasty. A simple workaround might be to store the main data (indexes) on an uncompressed filesystem, and leave only the larger blob fields in a separate table stored compressed, that way you could still do lightning fast searches and selects, only slowing down to grab the memos when absolutely required.
  • by Bazman ( 4849 ) on Thursday January 04, 2001 @02:37AM (#532132) Journal
    from The zlibc web site [linux.lu]

    Zlibc is a read-only compressed file-system emulation. It allows executables to uncompress their data files on the fly. No kernel patch, no re-compilation of the executables and the libraries is needed. Using gzip -9, a compression ratio of 1:3 can easily be achieved! (See examples below). This program has (almost) the same effect as a (read-only) compressed file system.

    See the web page for more.

    Baz

  • by Deven ( 13090 ) <deven@ties.org> on Thursday January 04, 2001 @08:43AM (#532133) Homepage
    Reading through a compressed random-access file may not be as big of a win as you think, since the db will need to decompress things to determine where your data is. OTOH, if you have a fast processor and slower media (CDROM) and plenty of RAM, the drawbacks will disappear after a few spins through due to caching.

    The caching won't save you from uncompressing the blocks repeatedly. If you really want to compress the database metadata, you basically need a block-oriented compressed filesystem that allows random access within compressed files. I don't know if such a thing already exists, but it's effectively what you'd be writing to do it...

    I'd just use zlib to compress the individual entries and not try to compress the entire database as a whole. I've done this before, and it actually works better than you'd think. Even with data entries as small as 50-100 bytes, you get reasonable compression. Yes, you'd get much better compression across the entire database, but you can't hope to access a fully-compressed database without uncompressing it or doing a lot of work to make random-access possible. (And like I said, at that point you might as well be making a compressed filesystem.)

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...