Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Communications Data Storage

Best Format for Archive Distribution? 109

Meostro asks: "I'm looking for the best format to use to distribute arbitrary datasets. Tarballs compressed with gzip seem to be the most common thing out there, with zip coming in a close second. What advanced compression packages are the most widely recognized or available on the widest array of systems? Cross-platform compatibility is my most important goal, followed by compression ratio, decompression time, compression time and extra features (solid archives, support for multiple files, etc.). I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more. I hope this will grow to be a significant (read: multi-gigabyte) archive, so I want to start off right with my distribution format. Right now the plan is data.tar.bz2, but i'm open to anything that will give me better compression as long as it's available for Linux, Windows and Mac."
This discussion has been archived. No new comments can be posted.

Best Format for Archive Distribution?

Comments Filter:
  • One other choice (Score:4, Insightful)

    by gowen ( 141411 ) <gwowen@gmail.com> on Wednesday March 09, 2005 @11:29AM (#11888724) Homepage Journal
    tar.gz and tar.bz2 are ok for small archives (20MB or so), but if you're dealing with large archives there's only one solution.

    RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.
    • Re:One other choice (Score:5, Informative)

      by MindStalker ( 22827 ) <mindstalker@[ ]il.com ['gma' in gap]> on Wednesday March 09, 2005 @11:50AM (#11888989) Journal
      Similar to rar I've found that ACE (www.winace.com) in maximum compression compresses most things better than RAR and is similar in fuctionality (it supports rar as well)
      Its linux verion is called unace and there is a macunace as well. Sadly these programs are a bit harder to find on the website but they are there.

      Luckly gentoo knows it so you can simply emerge unace.
      • by harrkev ( 623093 ) <kevin@harrelson.gmail@com> on Wednesday March 09, 2005 @12:04PM (#11889170) Homepage
        One problem with this is that it is not a common format. For limited use (one-time distribution, short-term backup), this is OK. But what about long-term archives.

        If you want to de-compress this stuff in 10 or 20 years, will you be able to find software then that can handle it? Epspecially if the new cell processors somehow become popular, will Windows BOHICA 2025 edition be able to run 20-year-old binaries in order to read this thing?

        If the source is available, the job is easier in Linux, but if the format is not actively maintained, it may take a lot of work to modify the program to run whatever Linux looks like in 20 years.
    • Re:One other choice (Score:3, Informative)

      by Meostro ( 788797 )
      I've used it on Windows forever, and I know I obtained unrar for Linux and AIX, bow cross-platform is RAR, really? Does it come standard in most distributions? I think if it does then it's probably an excellent choice, I've compressed some stuff almost 2:1 over bzip2 using RAR...
    • While there are Free rar unpackers [debian.org], the primary packer/unpacker has a proprietary license. He is catering to open source developers. It is a poor choice.

      md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.

      RAR doesn't have a monopoly on integrating these, though. 7zip certainly has many of the service features available in rar.
      • md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.
        But RAR/PAR encoding use a Reed-Solomon error correction scheme, so you can send a few additional files and the data blocks within them can be used to replace *any* lost data blocks from the original set. It's really crafty.

        You're right about the license, though.
        • But RAR/PAR encoding use a Reed-Solomon error correction scheme, so you can send a few additional files and the data blocks within them can be used to replace *any* lost data blocks from the original set. It's really crafty.

          You're right. It is clever & is better than md5 for repair purposes. Which is why is aaid md5 was probably enough for test data (I think bz2 archives are the way to go, so I'm a bit biased).

          But, once again, this is not a unique feature of rar. I know I've seen proprietary ZIP pro

          • It bears mentioning that RAR was designed from the get-go to support the features mentioned, and has for what, seven years now? What you mention is an effort to backport the technology that hasn't even been done yet.

            I'm not trying to trash 7-zip, but I'm also not going trying 7-zip when RAR has the track record it does for doing what it was designed to do.
            • It bears mentioning that RAR was designed from the get-go to support the features mentioned, and has for what, seven years now?

              RAR was not designed from the get-go to support Reed-Solomon repair. This was a feature added for 2.0 (released 1996). So, it has supported it for 8 years, but it was also back-ported.

              You also don't need to use RAR to use the same file recovery mechanisms. Use PAR [sourceforge.net] on ANY type of file to benefit from it!I'm not trying to trash 7-zip, but I'm also not going trying 7-zip when RAR h

      • 7-Zip is a one-man project that needs help to add features
        or at least manage its SourceForge support and RFE forums.

        Drag-n-drop has been requested for almost 2 years and now,
        some of its users are defecting to TUGZIP because of it.
        http://sourceforge.net/tracker/index.php?func=det a il&aid=663095&group_id=14481&atid=364481 [sourceforge.net]

        Either the guy is too busy, doesn't care or just doesn't want to share control.

        Maybe it's time to fork 7-zip?
    • RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.

      Now that's what I call comedy!

    • I don't know about anybody else, but when I come across one of those nasty RAR files I have to run WinRAR under wine to open it since the Debian unrar packages are fucked. That is not anywhere close to convenient.
    • Zip does all these things as well. I couldn't say whether RAR or Zip does them better.
    • don't forget http://p7zip.sf.net/ [sf.net] when talking about large archives; the 7Zip formats regularly beat rar; I had a 280MB file compress down to 54MB with rar a -m5, and down to 17MB with 7za -ultra. 7Zip has the added benefit of being less encumbered than RAR or ACE, and more open in use of algorithms.
  • Zip (Score:3, Insightful)

    by isaac ( 2852 ) on Wednesday March 09, 2005 @11:29AM (#11888731)
    Zip.

    Zip is miles more common than anything else and compresses better (generally) than gzip. It's supported out of the box on almost every OS either natively or with bundled software. Even Solaris comes with unzip.

    Forget .tar.bz2 unless your audience is the type of people you'd expect to have cygwin or 3rd-party compression tools installed on their windows peecees.

    -Isaac

    • My experience has been that ZIP doesn't compress as good as gzip, let alone bzip2. But yes, almost everyone can handle ZIPs.

      BTW, WinZIP can handle .tar.gz, I'm not sure whether it can handle .tar.bz2 as well.
    • Re:Zip (Score:3, Informative)

      by EvilIdler ( 21087 )
      Zip and gzip use the same compression.

      Zip compresses each file in an archive individually.

      Tar+gzip compresses the entire contents as a whole - meaning better
      compression than zip archives (unless you add uncompressed files to
      an archive, THEN compress the entire archive..)

      WinZip supports tar+gzip archives, from what I remember, but WinRAR
      supports .gz, tar.gz, .bz2 and .tar.bz2 files, so why use anything else
      on Windows?

      Then again, you could use solid RAR archives. Generally the best
      size+performance ratio I
      • Re:Zip (Score:3, Informative)

        by JimDabell ( 42870 )

        Zip and gzip use the same compression.

        According to the ZIP file format specification [pkware.com], ZIP can use a dynamic LZW algorithm.

        The whole reason gzip exists is because the standard UNIX compress uses LZW [gzip.org] - which, until recently, was protected by a patent (that was the problem with GIFs).

        Instead of using LZW, gzip uses the unprotected LZ algorithm, which doesn't contain the improvements that Welch (the 'W' in LZW) made.

        So not only do they not use the same algorithm, but that's the whole point of gzi

        • by isaac ( 2852 )
          So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!

          Thank you. I thought this was common knowledge.

          -Isaac

        • It looks like you're munging compress and zip together here. gzip was created in response to the patent status of the algorithm in compress, and the GP said that gzip uses the same algorithm as zip.

          So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!

          Well, let's look at a quote from the gzip page you linked:

          The first version of the compression algorithm used by gzip appeared in zip 0.9, publicly released on July 11th 1991.

          There you have it. Zip uses d

          • gzip was created in response to the patent status of the algorithm in compress

            I know. That's what I said. compress uses LZW.

            the GP said that gzip uses the same algorithm as zip.

            ZIP can use multiple algorithms, one of them being LZW - the very algorithm that gzip was created to avoid. ZIP and compress both use this algorithm, gzip does not.

            Zip uses deflate, just like gzip does. Sure, newer versions of zip can use LZW

            No, deflate is just one of the algorithms that ZIP can use. LZW is

            • I asked a question: "but how many programs actually generate zip files that use [LZW]?" Please answer it.

              Actually, I've done some research, and a few [frugalcorner.com] sources [info-zip.org] tell me that LZW is called "shrink" in zip vernacular and was only commonly used in the days of PKZip 1.1. It moved to Deflate as the default after that, and indeed, Info-Zip's unzip utility doesn't even enable unshrink by default. If LZW in zip files were common, that wouldn't be a very pragmatic thing to do, would it?

              Every zip utility out th

  • CPIO (Score:4, Interesting)

    by DarkDust ( 239124 ) * <marc@darkdust.net> on Wednesday March 09, 2005 @11:31AM (#11888752) Homepage
    I prefer .cpio.bz2 because unlike tar cpio can handle special devices just fine (or do I miss some switch for tar which makes it able to handle devices and links ?). Since it's also in the POSIX standard this should be pretty portable as well.
    • Comment removed based on user account deletion
    • Re:CPIO (Score:3, Informative)

      by Meostro ( 788797 )
      Probably a good idea in general, but:
      1. No obvious support on Windows for cpio
      2. These are going to be test files, so I shouldn't have to worry about special devices / links / sparse files.

      I know our admin here at work switched the backups from tar to cpio for exactly that reason, but it's just not universal enough to justify the departure from "normal".
    • Use a decent tar implementation. GNU tar handles block special devices just fine. It archives the block special devices, not the data you get if you open the contents of the device and read from it.

      Kirby

    • by hey! ( 33014 )
      Yes, not to mention its charming syntax.


      Once you get it to do something (other than the "find. -depth | cpio -pdl /destdir" kind of thing that is part of your fingers' auxillary programming), why not round it off with breakfast at Milliways?

    • Get a better tar. :) For instance, the bsdtar from FreeBSD can handle cpio, pax, and several different tar formats (for creating and extracting).

      Or, use pax. It's got a much nicer syntax than cpio, and can also handle cpio, pax, and tar formats.

      We used to use a horrible combination of cpio, bzip2, and split to image our servers. Was a royal pain to use, especially if you only wanted 1 or 2 files out of the backup. Switched to pax on Linux and bsdtar on *BSD, and everything is just hunky-dorry.

      Not com
  • by node 3 ( 115640 ) on Wednesday March 09, 2005 @11:37AM (#11888831)
    Zip is probably the most commonly installed archiver across all systems.

    tar/tar.gz/tar.bz is supported out-of-the-box on Linux and Mac OS X, but can throw Windows users for a loop (easily remedied, but they aren't likely to have untar installed, and will find the file extension at least a bit odd). For some data tar.bz will result in noticeably smaller files, but at a greater cost of compression/decompression time.

    After that, you're not really going to find an archival format that's really common.

    In the end, it depends on what type of data you are archiving, and your target audience, but unless you have a specific reason otherwise, zip with an md5 checksum file is probably the solution of least effort (just make sure you back-up the archive--don't want to have a problem with the only copy you have!).
  • Multi-format (Score:5, Insightful)

    by sporktoast ( 246027 ) on Wednesday March 09, 2005 @11:37AM (#11888833) Homepage

    Have you considered going multi-format?

    Either increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)), or use compression-on-the-fly libraries (behind a cache to reduce server load). This would allow the recipient to decide, and end up supporting perhaps a larger population.

    • ...handle 2 or 3 of the more popular and widely available formats...

      That's where i'm leaning, something like .tar.gz for universiality and RAR or similar for those that can handle it. I might offer "hot" sets in other formats, so the most popular stuff would be the most accessible but random, esoteric stuff would only come in one or two flavors.
    • ... increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)

      That is amusing. I've thought about having a .sig that says "bzip2, saving disk space on servers by using twice the disk space". Yes, bzip2 is a better compressor, but is it really significant today? I rarely if ever come across bandwidth problems. Disk space on my end is cheap. Bandwidth on my end is cheap. In fact, I don't keep commonly available archives on my computer an
      • That is amusing. I've thought about having a .sig that says "bzip2, saving disk space on servers by using twice the disk space". Yes, bzip2 is a better compressor, but is it really significant today? I rarely if ever come across bandwidth problems. Disk space on my end is cheap. Bandwidth on my end is cheap. In fact, I don't keep commonly available archives on my computer any more in any compression format. Odds are if I need the copy again, I can download and have it on my harddisk in less than 5 minutes

        • If I have a 10MB gz file that gets downloaded a thousand times, that's 10GB transfer. I could also store an 8MB bz2 version of the same thing, and if half of the people get that one instead of the 10MB version, that's only 9GB transfer. Since storage is cheap versus transfer, and since my storage:transfer ratio is going to be low, it makes much more sense to "waste" some extra disk space to save myself as much bandwidth as possible. Also from a server-load point of view, I can serve 111 more files in the sa
  • by gus goose ( 306978 ) on Wednesday March 09, 2005 @11:38AM (#11888839) Journal
    I have found that some formats are far better at some data types than others.

    e.g. for:
    text,ascii,documents: use any of bzip2, gzip, zip.
    audio: use nothing if MP3/AAC/etc. flac for other "raw" formats
    video: use the most appropriate encoding (mpeg4/divx,etc) and then don't try to compress.

    bzip encodes/decodes slower, but has typically better compression ratios.

    So, use whatever people commonly use for the data type you are compressing.

    gus
    • I think the point is this:

      (lotsa files) -> compress -> (one archive) -> de-compress -> (lotsa files)

      Audio and video codecs do not create an archive, and I think that the point is to have a general process, without having a bunch of exceptions based on file type.

      BTW: all audio and video codecs (except FLAC) are lossy. Data out != data in.

      • Lossy = very bad, i'm looking for a few general-purpose archivers. I'm sure there will be some MPEG stuff up there if for nothing other than a file format sample, but most stuff isn't going to be lossy-able and still make sense.

        I'd be perfectly happy to have separate archivers for different formats, but my main concern is universiality, even above compression ratio. FLAC or Monkey's for audio sounds great, but I need to be sure that everyone will be able to handle it. That's why I might be stuck with .z
  • My summary.. (Score:4, Informative)

    by Chris_Jefferson ( 581445 ) on Wednesday March 09, 2005 @11:41AM (#11888871) Homepage
    I find that for my data (your data may be different) I tend to get the ordering "zip,bzip2,rar,7zip" (from best to worse), with rar and 7zip often being much smaller than bzip2 (my data tends to contain lots of similar large files, which tends to lead to unusually large differences between compressors)

    Everyone has an unzipping program. I find on windows more people have (and get) rar than bzip2, particularily if they are afraid of command lines. 7zip gives the best compression of everyone, so is particularily useful for big datasets (I often send around 9-20GB data sets) but eats memory like no-ones business.

    It really comes down to how much you want to make people download compared to how much trouble you want the to go to.

    If you want to be "minimal effort", I'd advise providing a .zip along with other things, perhaps listing the size next to the files so people can see it's much bigger (like most sourceforge projects) for those windows users who can't be bothered to get anything else.
  • I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more.

    Your audience will be developers. Hopefully F/OSS developers. So distribute in a developer-friendly format. BZ2 files can be decompressed with Free software & most of the proprietary applications out there that would be decompressing alternative formats.

    While 7-zip is a nic

    • RAR has an open decompression library that allows for derivative works that decompress RAR formats. you can link it, modify it, use it, redistribute it modified, whatever, as long as you don't try to reverse-engineer the compression scheme. go download UnRAR [rarsoft.com] and read the damned license.
      • I acknowledged this in another post. I also use an LGPLed unrarer. But the RAR compression algorithm is, as the license makes very clear, "proprietary." I've seen very little F/OSS distributed as RAR archives, and I don't think it is coincidence.
        • The WinRar license states:
          Neither RAR binary code, WinRAR binary code, UnRAR source or UnRAR binary code may be used or reverse engineered to re-create the RAR compression algorithm, which is proprietary, without written permission of the author.

          and the source code:
          The unRAR sources may be used in any software to handle RAR archives without limitations free of charge, but cannot be used to re-create the RAR compression algorithm, which is proprietary. Distribution of modified unRAR sources in separate
          • the RAR compression algorithm, which is proprietary

            And there lies the problem that I mentioned. The compression algorithm isn't Free.

            So the guy is basically paranoid about keeping his "trade secret" a secret, which makes perfect sense from a business perspective.

            But why bother catering to his business when you could use something like 7-zip?

            As far as FOSS is concerned, he even presents decompression code free for all to use.

            It also makes sense from a business perspective: formats which are distributed gai

    • 7zip? Unmature? It is perfectly usable. I finally deleted my licensed copy of Winzip because 7zip does its work better. 'nuff said.
      • 7zip? Unmature? It is perfectly usable. I finally deleted my licensed copy of Winzip because 7zip does its work better. 'nuff said.

        7-zip is also on the win32 boxes I administer & winzip isn't. But this kind of use is not enough to prove the kind of maturity that I'm talking about. One piece of evidence is that most users aren't using it for 7zip archives. Another is that the *nix version of 7-zip is about 8 months old and listed as beta [sourceforge.net]. Only in October did KDE [kde.org] and GNOME [gnome.org] add support IN THEIR CVS. S

  • by Blakey Rat ( 99501 ) on Wednesday March 09, 2005 @11:47AM (#11888946)
    Sure, maybe your 17 MB file is 13.5 with .zip and 13.24 with .7z or whatever, but what it all comes down to is that every current operating system supports .zip files out-of-the-box. Do you think that extra 3 seconds of download is worth making your customer hunt down and install an entirely different program just to see your file? It's not. Why make things harder? .zip is the standard, use it.
    • The goal is not always to get data to a customer. What if I want to store some files for myself, as I often do or perhaps am transferring data to a computer that I know supports a given compression. While I will agree that for mass data distribution, more common formats like .zip are the way to go, one should not make a habit of compressing zip in all cases.
    • I recently tried unpacking a bzip2 package under windows. It took me ages to find something that would recognize it and extract it. Which is a shame because it is a nice format ... if you aren't doing this a lot since it takes more time.

      However, winzip out of the box will open tarballs and of course zip. And gzip / unzip are pretty much universal on *nix. I have however found that very large tarballs can be a problem with Winzip (like 100+ MB) but that was a long time ago.

      And I would never, for the origin

  • My experience is that Zip doesn't handle archiving multiple files with the same name. Zip fails if you have a directory structure like..
    foo.txt /images/foo.txt
    I've also seen zip fail completely trying to compress a directory structure containing very large numbers of small files > 10,000.

    I always use RAR unless I know the recpient can't handle a RAR file.
  • by BinLadenMyHero ( 688544 ) <binladen@9[ ]ls.org ['hel' in gap]> on Wednesday March 09, 2005 @11:55AM (#11889041) Journal
    ...avoid closed formats.
    Using Free software will help you archive your number one goal: that everyone can access the data, now and forever.
    • Don't worry about cross platform stuff. Choose something open and it can be ported, even to new, currently non-existant platforms.

      Open algorithms, open source, no BS, makes your choice easy.
      • If the point of this is to make data available, then cross-platform availability needs to be my primary concern.

        I can spec and write and offer to the public the SuperXtreme Archive format, and make my data available only in that format. Unless there is a compeling reason to switch to SXA for other purposes (general adoption), then it's essentially proprietary to my site and won't really be of any use to anyone.

        OSS is not the be-all and end-all of utility or availability, only of portability.
  • Make two versions of your file available. Use "zip" for its universality, so anyone on any platform can get your file, if they want.

    Then, make a more efficiently compressed one for those who know how to download and use it. Bzip2 seems to be the current favorite, especially for text.
  • Don't forget about security issues. If you intend to mail these files as attachments, ZIP and RAR may be blocked by mail servers because both can be executable under Windows. tar.bz2 may be more difficult for a Windows user to figure out, but at least it's not going to infect their computer without a lot of work on their part.

    My other comment is to do some experiments with *your* data -- which format actually yields the best compression rate, and how much more time do you spend doing the compression / un
    • My other comment is to do some experiments with *your* data -- which format actually yields the best compression rate, and how much more time do you spend doing the compression / uncompression. Is the extra time you spend worth the 5% you get?

      My data doesn't exist yet, that's part of the problem. I need one or a few good general-purpose archive formats, the particulars of which may perform better on one dataset than another. Decompression time matters because it will affect end-users, but compression ti
  • I've got a rather technical format comparison chart started up [1]. It's still a draft, but it's pretty complete.

    It doesn't directly address relative compression ratios nor benchmarks. And it's mostly about the formats themselves, not the libraries that implement them. But it's still good for a birds-eye view, I think.

    [1] http://darbinding.sourceforge.net/about_dar.php [sourceforge.net] (The chart is at the bottom of the page.)

  • is the best compression mechanism I've seen. Getting your data back is a bitch, though.
    • You can always get your data back just fine from /dev/random. You just have to figure out where it starts there, which can sometimes be difficult.

      • I found that the restoring process can be enhanced greatly disabling any form of CRC checking when reading from /dev/random...
  • by mqx ( 792882 ) on Wednesday March 09, 2005 @12:26PM (#11889371)

    These formats (rar, arj, 7zip, whatever) are not as widely supported as (a) tar/gz, tar/bz2, (b) gzip/bzip2, (c) zip. You can get (a)-(c) on just about every platform. Numerous times I've downloaded "unzip" to make (c) work. It's so simple.

    Another point not mentioned elsewhere: virus scanners: at least with the popular tar/zip formats, you know that virus scanners understand them and can look for problems.

    Sure, you may get a few extra features or a little more room out of non-standard archivers, but that's largely not an issue.

    • I found tar to be a dated format that has no checksumming of individual files. I ran into a situation where a large tarball was made, and tar tf foo.tar done to verify it. A later attempt at extract failed due to corruption.

      There are horrors that arise with tar. First, there are multiple tar record formats. The original tar only supported 14-character file names (original unix file system limitation). Along came a second tar format, but even that ended up with variants. Most people are using the GNU tar fo
  • even though it's not free, i'm quite fond of Stuffit's sitx format. the expander is availble as a free (not as in beer) download from http://www.stuffit.com/ [stuffit.com], as well as being included on the mac platform.
    • by Anonymous Coward
      Making sure your data doesn't get corrupted should be more of an issue than how compressed you can get it. Sadly, Stuffit is the only thing I can find that has error correction. I'm suprised because error correction is old as the hills. The Bose-Chaudhuri algorithm comes to mind.

      I just went on a search for some ten year old data and the first place I found it, it was corrupted. Thank goodness for redundancy. I finally found an uncorrupted version but it took me a couple of days which could have been p
      • RAR at least has error correction built in, StuffIt isn't the only thing out there.

        In this case, corruption is not an issue. I intend to keep redundant backups of the original datasets, so even if the web-based archives get corrupted I will be able to recover the data.
    • The new version of Stuffit looks like it will rock. That 25-30% JPEG compression, plus a generally competitive algorithm shows great potential.
      • Re:stuffit (Score:3, Informative)

        by Hes Nikke ( 237581 )
        While we are on the subject of JPEG compression, i recently launched http://jpgcrunch.com/ [jpgcrunch.com], which reduces JPEG file sizes losslessly. that can help things too :D

        *Battens down the hatches for an incoming barrage of slashdot traffic*
  • For Longevity (Score:5, Insightful)

    by 4of12 ( 97621 ) on Wednesday March 09, 2005 @12:42PM (#11889544) Homepage Journal

    Pick any system for which the source code is available, eg .tar.bz2

    Anything else is gambling.

    I still gamble, but only that a C compiler will exist in the future.

  • LZIP (Score:1, Funny)

    by Victor_Os ( 677960 )
    lzip, of course http://sourceforge.net/projects/lzip/ [sourceforge.net]
  • My Preference (Score:3, Informative)

    by vbrtrmn ( 62760 ) on Wednesday March 09, 2005 @02:00PM (#11890715) Homepage
    For my own personal archives, I have taken the methods from the masters in USENET.

    OS X & UNIX: I'm lazy just: tar.gz

    For Win32, I back-up a lot more files under win32 than *nix.

    Compression
    WinRAR [rarlab.com]
    Compression Method: Best
    Split to Volumes: 20MB
    Parity
    QuickPar [quickpar.org.uk]
    With general settings.

    I back-up to decent quality DVD media, as I have had a lot of problems with CD media rotting [mv.com] after about a year.
  • The various types of data you mention are not uniformly compressable by any single algorithm. Therefore your "compression ratio" criterion is dubious. Compression ratio on what sort of data?

    Dictionary-based compression schemes work well on data which might be described as "linguistic," i.e., data which has some kind of grammar describing it. English text, machine code (binaries), source code, HTML, etc. It won't work very well at all on audio or image data, at least not without some kind of preprocessing

    • The various types of data you mention are not uniformly compressable by any single algorithm. Therefore your "compression ratio" criterion is dubious. Compression ratio on what sort of data?

      Compresison on any sort of data. Give me a good general-purpose compressor. Give me a good one just for audio, just for video, just for text, whatever. I have no idea what kinds of data i'm going to get, or how much of each kind there might be. A custom compressor for audio (FLAC) will almost always outperform a ge
      • Compresison on any sort of data. Give me a good general-purpose compressor.

        There isn't any such thing, unless you use a very narrow definition of "general purpose." To me, general purpose would imply that I could throw any sort of data at it I please, and it would do well, provided the data is not just random data. Since we're working with very vague definitions I can only give very vague suggestions.

        To answer your specific questions, a good dictionary compressor is the Flate algorithm used by gzip. Ve

        • There isn't any such thing, unless you use a very narrow definition of "general purpose." To me, general purpose would imply that I could throw any sort of data at it I please, and it would do well, provided the data is not just random data. Since we're working with very vague definitions I can only give very vague suggestions.

          My "general purpose" is exactly what you said, without the "would do well" part, or substituting "would not do horribly" in its place. I know that all algorithms have strengths and
          • Anything where minute details don't matter will probably be lossy, but probably 80% or more will be text, code and other data that needs to be lossless.

            Whether text compression needs to be lossless is actually debatable. I'm gonna veer a little off topic here, but hey, it's Slashdot...

            Suppose you are compressing English text by Huffman encoding entire words at a time. However, people make typos, so the actual set of words to be encoded will be larger than a set where there were no typos. By first runni

  • by Trelane ( 16124 ) on Wednesday March 09, 2005 @03:14PM (#11891748) Journal
    I've used both gzip and bzip2. I rather like bzip2 for plain text data files, but there is a rather large cost--compressing and uncompressing can take a much longer time than with gzip! These are important considerations to make, especially if you're gonna need to pull this data back off the shelf anytime soon. For this reason (time), I currently use gzip for intermediate-range storage.
  • gzip is really horrible in error recovery. Trying to recover data from a damaged tar.gzip file is hard, because gzip does not keep byte boundaries. bzip2 is much better in this respect, and it is much better to recover from problems.
    Never backup using tar.gz - use tar.bz2 instead.
    Not on topic, but I had to say it, 'cause I've been hurt (nothing like finding that your .tar.gz backup was damaged, just when you need it)
  • uharc (Score:2, Informative)

    by biryokumaru ( 822262 ) *
    no one seems to remember uharc these days. uharc is still a tighter compression than rar or bz2. it's hard to come by, though. personally i still use tar bz2, cuz no one can decompress uharcs, and its usually a little better than rars, atleast for source files. i think it was open source at some point, which solves the long-term dilemma faced with flash in the pan formats... if you could find the source.
  • if you are making a site which is for people to download stuff from then why not use real-time web compression?

    1. your web server can display the un-compressed version of the file on the server , 2. then the user starts the download from the browser, 3. the webserver compresses it on the fly and delivers it to the browser which unzips it when its done.

    this saves you time from having to zip and unzip all the time and it SERVES your original purpose.
  • The HVSC [c64.org] is a 40-50MB (compressed) collection of a huge number of small files. They provide the current version in zip and rar format, with a set of incremental upgrades as zips. I would start by looking at their model.

    Though personally, I prefer 7-zip and Stuffit (can't wait for the new version).

  • by Detritus ( 11846 ) on Thursday March 10, 2005 @05:50AM (#11897932) Homepage
    When choosing a standard, don't forget to check how the format deals with large files and large archives. I've run into numerous problems with software that can't deal with anything larger than 2 GB, either for individual files or total archive size.
  • Maybe this is a bit off topic, but whatevery happened to the ARJ format? It used to be king. One of its best features was being able to arj something onto multiple floppy disks.

    Can anyone enlighten me on the fate of this once most favoured compression algorithm?
  • ...is to upload the material to a Gmail account, then send the recipient the account name and key. Let Google handle the data compression, backup, system maintenance, etc.

Our business in life is not to succeed but to continue to fail in high spirits. -- Robert Louis Stevenson

Working...