Best Format for Archive Distribution?

Follow Slashdot stories on Twitter

Best Format for Archive Distribution? 109

Posted by Cliff on Wednesday March 09, 2005 @11:24AM from the bit-crunching dept.

Meostro asks: "I'm looking for the best format to use to distribute arbitrary datasets. Tarballs compressed with gzip seem to be the most common thing out there, with zip coming in a close second. What advanced compression packages are the most widely recognized or available on the widest array of systems? Cross-platform compatibility is my most important goal, followed by compression ratio, decompression time, compression time and extra features (solid archives, support for multiple files, etc.). I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more. I hope this will grow to be a significant (read: multi-gigabyte) archive, so I want to start off right with my distribution format. Right now the plan is data.tar.bz2, but i'm open to anything that will give me better compression as long as it's available for Linux, Windows and Mac."

This discussion has been archived. No new comments can be posted.

Best Format for Archive Distribution?

Load All Comments

Search 109 Comments Log In/Create an Account

Comments Filter:

One other choice (Score:4, Insightful)

by gowen ( 141411 ) writes: <gwowen@gmail.com> on Wednesday March 09, 2005 @11:29AM (#11888724) Homepage Journal

tar.gz and tar.bz2 are ok for small archives (20MB or so), but if you're dealing with large archives there's only one solution.

RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.

Share
twitter facebook
- Re:One other choice (Score:5, Informative)
  
  by MindStalker ( 22827 ) writes: <mindstalker@[ ]il.com ['gma' in gap]> on Wednesday March 09, 2005 @11:50AM (#11888989) Journal
  
  Similar to rar I've found that ACE (www.winace.com) in maximum compression compresses most things better than RAR and is similar in fuctionality (it supports rar as well)
  Its linux verion is called unace and there is a macunace as well. Sadly these programs are a bit harder to find on the website but they are there.
  
  Luckly gentoo knows it so you can simply emerge unace.
  
  Parent Share
  twitter facebook
  - Re:One other choice (Score:5, Insightful)
    
    by harrkev ( 623093 ) writes: <kevin@harrelson.gmail@com> on Wednesday March 09, 2005 @12:04PM (#11889170) Homepage
    
    One problem with this is that it is not a common format. For limited use (one-time distribution, short-term backup), this is OK. But what about long-term archives.
    
    If you want to de-compress this stuff in 10 or 20 years, will you be able to find software then that can handle it? Epspecially if the new cell processors somehow become popular, will Windows BOHICA 2025 edition be able to run 20-year-old binaries in order to read this thing?
    
    If the source is available, the job is easier in Linux, but if the format is not actively maintained, it may take a lot of work to modify the program to run whatever Linux looks like in 20 years.
    
    Parent Share
    twitter facebook
    - - Re:One other choice (Score:4, Informative)
        
        by Deagol ( 323173 ) writes: on Wednesday March 09, 2005 @12:35PM (#11889464) Homepage
        
        Isn't PKZip pushing the 20 year mark? And I think that Unix tar'ed and/or compress(1)'ed files are well over 20 years old.
        Give me a foo.tar.Z file from the early 80s and I can still uncompress it. Give me a foo.zip from a mid-80's BBS archive and I can still see what's insite.
        Also, see graphics formats.
        
        Parent Share
        twitter facebook
- Re:One other choice (Score:3, Informative)
  
  by Meostro ( 788797 ) writes:
  
  I've used it on Windows forever, and I know I obtained unrar for Linux and AIX, bow cross-platform is RAR, really? Does it come standard in most distributions? I think if it does then it's probably an excellent choice, I've compressed some stuff almost 2:1 over bzip2 using RAR...
  - Re:One other choice (Score:2)
    
    by Meostro ( 788797 ) writes:
    
    doh... bow = but how
- RAR's license is garbage (Score:3, Informative)
  
  by Noksagt ( 69097 ) writes:
  
  While there are Free rar unpackers [debian.org], the primary packer/unpacker has a proprietary license. He is catering to open source developers. It is a poor choice.
  
  md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.
  
  RAR doesn't have a monopoly on integrating these, though. 7zip certainly has many of the service features available in rar.
  - Re:RAR's license is garbage (Score:3, Interesting)
    
    by gowen ( 141411 ) writes:
    
    md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.
    
    But RAR/PAR encoding use a Reed-Solomon error correction scheme, so you can send a few additional files and the data blocks within them can be used to replace *any* lost data blocks from the original set. It's really crafty.
    
    You're right about the license, though.
    - Re:RAR's license is garbage (Score:2)
      
      by Noksagt ( 69097 ) writes:
      
      But RAR/PAR encoding use a Reed-Solomon error correction scheme, so you can send a few additional files and the data blocks within them can be used to replace *any* lost data blocks from the original set. It's really crafty.
      You're right. It is clever & is better than md5 for repair purposes. Which is why is aaid md5 was probably enough for test data (I think bz2 archives are the way to go, so I'm a bit biased).
      
      But, once again, this is not a unique feature of rar. I know I've seen proprietary ZIP pro
      - Re:RAR's license is garbage (Score:1)
        
        by UnrefinedLayman ( 185512 ) writes:
        
        It bears mentioning that RAR was designed from the get-go to support the features mentioned, and has for what, seven years now? What you mention is an effort to backport the technology that hasn't even been done yet.
        
        I'm not trying to trash 7-zip, but I'm also not going trying 7-zip when RAR has the track record it does for doing what it was designed to do.
        
        PAR File Recovery (Score:2)
        
        by Noksagt ( 69097 ) writes:
        
        It bears mentioning that RAR was designed from the get-go to support the features mentioned, and has for what, seven years now?
        RAR was not designed from the get-go to support Reed-Solomon repair. This was a feature added for 2.0 (released 1996). So, it has supported it for 8 years, but it was also back-ported.
        You also don't need to use RAR to use the same file recovery mechanisms. Use PAR [sourceforge.net] on ANY type of file to benefit from it!I'm not trying to trash 7-zip, but I'm also not going trying 7-zip when RAR h
  - 7-zip: No Drag-n-drop (Score:3, Informative)
    
    by denis-The-menace ( 471988 ) writes:
    
    7-Zip is a one-man project that needs help to add features
    or at least manage its SourceForge support and RFE forums.
    
    Drag-n-drop has been requested for almost 2 years and now,
    some of its users are defecting to TUGZIP because of it.
    http://sourceforge.net/tracker/index.php?func=det a il&aid=663095&group_id=14481&atid=364481 [sourceforge.net]
    
    Either the guy is too busy, doesn't care or just doesn't want to share control.
    
    Maybe it's time to fork 7-zip?
- Re:One other choice (Score:1)
  
  by Gnulix ( 534608 ) writes:
  
  RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.
  Now that's what I call comedy!
- Re:One other choice (Score:2)
  
  by Jebediah21 ( 145272 ) writes:
  
  I don't know about anybody else, but when I come across one of those nasty RAR files I have to run WinRAR under wine to open it since the Debian unrar packages are fucked. That is not anywhere close to convenient.
  - Re:One other choice (Score:2)
    
    by zerblat ( 785 ) writes:
    
    The free unrar in Debian main can only handle older versions of RAR. Use the non-free (shareware) rar [debian.org] instead.
    - Re:One other choice (Score:2)
      
      by Jebediah21 ( 145272 ) writes:
      
      Thanks. I have tried all the versions of unrar available. Do I instead have to install rar and not unrar for this?
- Re:One other choice (Score:2)
  
  by fm6 ( 162816 ) writes:
  
  Zip does all these things as well. I couldn't say whether RAR or Zip does them better.
- Re:One other choice (Score:2)
  
  by Em Adespoton ( 792954 ) writes:
  
  don't forget http://p7zip.sf.net/ [sf.net] when talking about large archives; the 7Zip formats regularly beat rar; I had a 280MB file compress down to 54MB with rar a -m5, and down to 17MB with 7za -ultra. 7Zip has the added benefit of being less encumbered than RAR or ACE, and more open in use of algorithms.
Zip (Score:3, Insightful)

by isaac ( 2852 ) writes: on Wednesday March 09, 2005 @11:29AM (#11888731)

Zip.

Zip is miles more common than anything else and compresses better (generally) than gzip. It's supported out of the box on almost every OS either natively or with bundled software. Even Solaris comes with unzip.

Forget .tar.bz2 unless your audience is the type of people you'd expect to have cygwin or 3rd-party compression tools installed on their windows peecees.

-Isaac

Share
twitter facebook
- Re:Zip (Score:2)
  
  by DarkDust ( 239124 ) * writes:
  
  My experience has been that ZIP doesn't compress as good as gzip, let alone bzip2. But yes, almost everyone can handle ZIPs.
  
  BTW, WinZIP can handle .tar.gz, I'm not sure whether it can handle .tar.bz2 as well.
  - Re:Zip (Score:2)
    
    by Directrix1 ( 157787 ) writes:
    
    It can
- Re:Zip (Score:3, Informative)
  
  by EvilIdler ( 21087 ) writes:
  
  Zip and gzip use the same compression.
  
  Zip compresses each file in an archive individually.
  
  Tar+gzip compresses the entire contents as a whole - meaning better
  compression than zip archives (unless you add uncompressed files to
  an archive, THEN compress the entire archive..)
  
  WinZip supports tar+gzip archives, from what I remember, but WinRAR
  supports .gz, tar.gz, .bz2 and .tar.bz2 files, so why use anything else
  on Windows?
  
  Then again, you could use solid RAR archives. Generally the best
  size+performance ratio I
  - Re:Zip (Score:3, Informative)
    
    by JimDabell ( 42870 ) writes:
    
    Zip and gzip use the same compression.
    According to the ZIP file format specification [pkware.com], ZIP can use a dynamic LZW algorithm.
    The whole reason gzip exists is because the standard UNIX compress uses LZW [gzip.org] - which, until recently, was protected by a patent (that was the problem with GIFs).
    Instead of using LZW, gzip uses the unprotected LZ algorithm, which doesn't contain the improvements that Welch (the 'W' in LZW) made.
    So not only do they not use the same algorithm, but that's the whole point of gzi
    - Re:Zip (Score:2)
      
      by isaac ( 2852 ) writes:
      
      So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!
      
      Thank you. I thought this was common knowledge.
      -Isaac
    - Re:Zip (Score:2)
      
      by DeeKayWon ( 155842 ) writes:
      
      It looks like you're munging compress and zip together here. gzip was created in response to the patent status of the algorithm in compress, and the GP said that gzip uses the same algorithm as zip.
      So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!
      Well, let's look at a quote from the gzip page you linked:
      
      The first version of the compression algorithm used by gzip appeared in zip 0.9, publicly released on July 11th 1991.
      There you have it. Zip uses d
      - Re:Zip (Score:1)
        
        by JimDabell ( 42870 ) writes:
        
        gzip was created in response to the patent status of the algorithm in compress
        I know. That's what I said. compress uses LZW.
        the GP said that gzip uses the same algorithm as zip.
        ZIP can use multiple algorithms, one of them being LZW - the very algorithm that gzip was created to avoid. ZIP and compress both use this algorithm, gzip does not.
        Zip uses deflate, just like gzip does. Sure, newer versions of zip can use LZW
        No, deflate is just one of the algorithms that ZIP can use. LZW is
        
        Re:Zip (Score:2)
        
        by DeeKayWon ( 155842 ) writes:
        
        I asked a question: "but how many programs actually generate zip files that use [LZW]?" Please answer it.
        Actually, I've done some research, and a few [frugalcorner.com] sources [info-zip.org] tell me that LZW is called "shrink" in zip vernacular and was only commonly used in the days of PKZip 1.1. It moved to Deflate as the default after that, and indeed, Info-Zip's unzip utility doesn't even enable unshrink by default. If LZW in zip files were common, that wouldn't be a very pragmatic thing to do, would it?
        Every zip utility out th
CPIO (Score:4, Interesting)

by DarkDust ( 239124 ) * writes: <marc@darkdust.net> on Wednesday March 09, 2005 @11:31AM (#11888752) Homepage

I prefer .cpio.bz2 because unlike tar cpio can handle special devices just fine (or do I miss some switch for tar which makes it able to handle devices and links ?). Since it's also in the POSIX standard this should be pretty portable as well.

Share
twitter facebook
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re:CPIO (Score:3, Informative)
  
  by Meostro ( 788797 ) writes:
  
  Probably a good idea in general, but:
  1. No obvious support on Windows for cpio
  2. These are going to be test files, so I shouldn't have to worry about special devices / links / sparse files.
  
  I know our admin here at work switched the backups from tar to cpio for exactly that reason, but it's just not universal enough to justify the departure from "normal".
- Re:CPIO (Score:2)
  
  by ComputerSlicer23 ( 516509 ) writes:
  
  Use a decent tar implementation. GNU tar handles block special devices just fine. It archives the block special devices, not the data you get if you open the contents of the device and read from it.
  Kirby
- Re:CPIO (Score:2)
  
  by hey! ( 33014 ) writes:
  
  Yes, not to mention its charming syntax.
  
  Once you get it to do something (other than the "find. -depth | cpio -pdl /destdir" kind of thing that is part of your fingers' auxillary programming), why not round it off with breakfast at Milliways?
- Re:CPIO (Score:2)
  
  by phoenix_rizzen ( 256998 ) writes:
  
  Get a better tar. :) For instance, the bsdtar from FreeBSD can handle cpio, pax, and several different tar formats (for creating and extracting).
  
  Or, use pax. It's got a much nicer syntax than cpio, and can also handle cpio, pax, and tar formats.
  
  We used to use a horrible combination of cpio, bzip2, and split to image our servers. Was a royal pain to use, especially if you only wanted 1 or 2 files out of the backup. Switched to pax on Linux and bsdtar on *BSD, and everything is just hunky-dorry.
  
  Not com
It really depends... (Score:3, Insightful)

by node 3 ( 115640 ) writes: on Wednesday March 09, 2005 @11:37AM (#11888831)

Zip is probably the most commonly installed archiver across all systems.

tar/tar.gz/tar.bz is supported out-of-the-box on Linux and Mac OS X, but can throw Windows users for a loop (easily remedied, but they aren't likely to have untar installed, and will find the file extension at least a bit odd). For some data tar.bz will result in noticeably smaller files, but at a greater cost of compression/decompression time.

After that, you're not really going to find an archival format that's really common.

In the end, it depends on what type of data you are archiving, and your target audience, but unless you have a specific reason otherwise, zip with an md5 checksum file is probably the solution of least effort (just make sure you back-up the archive--don't want to have a problem with the only copy you have!).

Share
twitter facebook
- Re:It really depends... (Score:2, Informative)
  
  by OAB_X ( 818333 ) writes:
  
  Winrar can open tar.gz files
- Re:It really depends... (Score:2)
  
  by Michael.Forman ( 169981 ) * writes:
  
  This is incorrect. Winzip handles "tar.gz" files out of the box, making it an excellent choice for cross-platform directory storage and compression.
  
  Michael.
Multi-format (Score:5, Insightful)

by sporktoast ( 246027 ) writes: on Wednesday March 09, 2005 @11:37AM (#11888833) Homepage

Have you considered going multi-format?
Either increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)), or use compression-on-the-fly libraries (behind a cache to reduce server load). This would allow the recipient to decide, and end up supporting perhaps a larger population.

Share
twitter facebook
- Re:Multi-format (Score:2)
  
  by Meostro ( 788797 ) writes:
  
  ...handle 2 or 3 of the more popular and widely available formats...
  
  That's where i'm leaning, something like .tar.gz for universiality and RAR or similar for those that can handle it. I might offer "hot" sets in other formats, so the most popular stuff would be the most accessible but random, esoteric stuff would only come in one or two flavors.
- Re:Multi-format (Score:2)
  
  by hackstraw ( 262471 ) * writes:
  
  ... increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)
  
  That is amusing. I've thought about having a .sig that says "bzip2, saving disk space on servers by using twice the disk space". Yes, bzip2 is a better compressor, but is it really significant today? I rarely if ever come across bandwidth problems. Disk space on my end is cheap. Bandwidth on my end is cheap. In fact, I don't keep commonly available archives on my computer an
  - Re:Multi-format (Score:2)
    
    by Meostro ( 788797 ) writes:
    
    That is amusing. I've thought about having a .sig that says "bzip2, saving disk space on servers by using twice the disk space". Yes, bzip2 is a better compressor, but is it really significant today? I rarely if ever come across bandwidth problems. Disk space on my end is cheap. Bandwidth on my end is cheap. In fact, I don't keep commonly available archives on my computer any more in any compression format. Odds are if I need the copy again, I can download and have it on my harddisk in less than 5 minutes
    - Re:Multi-format (Score:2)
      
      by hackstraw ( 262471 ) * writes:
      
      If I have a 10MB gz file that gets downloaded a thousand times, that's 10GB transfer. I could also store an 8MB bz2 version of the same thing, and if half of the people get that one instead of the 10MB version, that's only 9GB transfer. Since storage is cheap versus transfer, and since my storage:transfer ratio is going to be low, it makes much more sense to "waste" some extra disk space to save myself as much bandwidth as possible. Also from a server-load point of view, I can serve 111 more files in the sa
Look into specific compress utilities, not generic (Score:3, Informative)

by gus goose ( 306978 ) writes: on Wednesday March 09, 2005 @11:38AM (#11888839) Journal

I have found that some formats are far better at some data types than others.

e.g. for:
text,ascii,documents: use any of bzip2, gzip, zip.
audio: use nothing if MP3/AAC/etc. flac for other "raw" formats
video: use the most appropriate encoding (mpeg4/divx,etc) and then don't try to compress.

bzip encodes/decodes slower, but has typically better compression ratios.

So, use whatever people commonly use for the data type you are compressing.

gus

Share
twitter facebook
- Re:Look into specific compress utilities, not gene (Score:1)
  
  by harrkev ( 623093 ) writes:
  
  I think the point is this:
  
  (lotsa files) -> compress -> (one archive) -> de-compress -> (lotsa files)
  
  Audio and video codecs do not create an archive, and I think that the point is to have a general process, without having a bunch of exceptions based on file type.
  
  BTW: all audio and video codecs (except FLAC) are lossy. Data out != data in.
  - Re:Look into specific compress utilities, not gene (Score:2)
    
    by Meostro ( 788797 ) writes:
    
    Lossy = very bad, i'm looking for a few general-purpose archivers. I'm sure there will be some MPEG stuff up there if for nothing other than a file format sample, but most stuff isn't going to be lossy-able and still make sense.
    
    I'd be perfectly happy to have separate archivers for different formats, but my main concern is universiality, even above compression ratio. FLAC or Monkey's for audio sounds great, but I need to be sure that everyone will be able to handle it. That's why I might be stuck with .z
My summary.. (Score:4, Informative)

by Chris_Jefferson ( 581445 ) writes: on Wednesday March 09, 2005 @11:41AM (#11888871) Homepage

I find that for my data (your data may be different) I tend to get the ordering "zip,bzip2,rar,7zip" (from best to worse), with rar and 7zip often being much smaller than bzip2 (my data tends to contain lots of similar large files, which tends to lead to unusually large differences between compressors)

Everyone has an unzipping program. I find on windows more people have (and get) rar than bzip2, particularily if they are afraid of command lines. 7zip gives the best compression of everyone, so is particularily useful for big datasets (I often send around 9-20GB data sets) but eats memory like no-ones business.

It really comes down to how much you want to make people download compared to how much trouble you want the to go to.

If you want to be "minimal effort", I'd advise providing a .zip along with other things, perhaps listing the size next to the files so people can see it's much bigger (like most sourceforge projects) for those windows users who can't be bothered to get anything else.

Share
twitter facebook
- Re:My summary.. (Score:3, Informative)
  
  by Chris_Jefferson ( 581445 ) writes:
  
  damn, damn, damn, damn.. I meant of course that zip is the worst, 7zip the best... which I could edit comments!
  - Re:My summary.. (Score:1)
    
    by Scuff ( 59882 ) writes:
    
    don't worry too much about it, it was easy enough to figure out what you meant as long as people read the whole comment
Know your audience (Score:2)

by Noksagt ( 69097 ) writes:

I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more.
Your audience will be developers. Hopefully F/OSS developers. So distribute in a developer-friendly format. BZ2 files can be decompressed with Free software & most of the proprietary applications out there that would be decompressing alternative formats.

While 7-zip is a nic
- Re:Know your audience (Score:2)
  
  by the_greywolf ( 311406 ) writes:
  
  RAR has an open decompression library that allows for derivative works that decompress RAR formats. you can link it, modify it, use it, redistribute it modified, whatever, as long as you don't try to reverse-engineer the compression scheme. go download UnRAR [rarsoft.com] and read the damned license.
  - Re:Know your audience (Score:2)
    
    by Noksagt ( 69097 ) writes:
    
    I acknowledged this in another post. I also use an LGPLed unrarer. But the RAR compression algorithm is, as the license makes very clear, "proprietary." I've seen very little F/OSS distributed as RAR archives, and I don't think it is coincidence.
    - Re:Know your audience (Score:2)
      
      by Meostro ( 788797 ) writes:
      
      The WinRar license states:
      Neither RAR binary code, WinRAR binary code, UnRAR source or UnRAR binary code may be used or reverse engineered to re-create the RAR compression algorithm, which is proprietary, without written permission of the author.
      
      and the source code:
      The unRAR sources may be used in any software to handle RAR archives without limitations free of charge, but cannot be used to re-create the RAR compression algorithm, which is proprietary. Distribution of modified unRAR sources in separate
      - Re:Know your audience (Score:2)
        
        by Noksagt ( 69097 ) writes:
        
        the RAR compression algorithm, which is proprietary
        And there lies the problem that I mentioned. The compression algorithm isn't Free.
        So the guy is basically paranoid about keeping his "trade secret" a secret, which makes perfect sense from a business perspective.
        But why bother catering to his business when you could use something like 7-zip?
        As far as FOSS is concerned, he even presents decompression code free for all to use.
        It also makes sense from a business perspective: formats which are distributed gai
- Re:Know your audience (Score:1)
  
  by M1FCJ ( 586251 ) writes:
  
  7zip? Unmature? It is perfectly usable. I finally deleted my licensed copy of Winzip because 7zip does its work better. 'nuff said.
  - Re:Know your audience (Score:2)
    
    by Noksagt ( 69097 ) writes:
    
    7zip? Unmature? It is perfectly usable. I finally deleted my licensed copy of Winzip because 7zip does its work better. 'nuff said.
    7-zip is also on the win32 boxes I administer & winzip isn't. But this kind of use is not enough to prove the kind of maturity that I'm talking about. One piece of evidence is that most users aren't using it for 7zip archives. Another is that the *nix version of 7-zip is about 8 months old and listed as beta [sourceforge.net]. Only in October did KDE [kde.org] and GNOME [gnome.org] add support IN THEIR CVS. S
.zip file ease-of-use beats out saving 4 bytes (Score:3, Interesting)

by Blakey Rat ( 99501 ) writes: on Wednesday March 09, 2005 @11:47AM (#11888946)

Sure, maybe your 17 MB file is 13.5 with .zip and 13.24 with .7z or whatever, but what it all comes down to is that every current operating system supports .zip files out-of-the-box. Do you think that extra 3 seconds of download is worth making your customer hunt down and install an entirely different program just to see your file? It's not. Why make things harder? .zip is the standard, use it.

Share
twitter facebook
- Re:.zip file ease-of-use beats out saving 4 bytes (Score:1)
  
  by mongolian ( 768610 ) writes:
  
  The goal is not always to get data to a customer. What if I want to store some files for myself, as I often do or perhaps am transferring data to a computer that I know supports a given compression. While I will agree that for mass data distribution, more common formats like .zip are the way to go, one should not make a habit of compressing zip in all cases.
  - Re:.zip file ease-of-use beats out saving 4 bytes (Score:2)
    
    by Blakey Rat ( 99501 ) writes:
    
    Read the article summary. The goal *is* to get data to a customer.
- zip or tgz: yes. bzip2 sadly no. (Score:2)
  
  by Evil Pete ( 73279 ) writes:
  
  I recently tried unpacking a bzip2 package under windows. It took me ages to find something that would recognize it and extract it. Which is a shame because it is a nice format ... if you aren't doing this a lot since it takes more time.
  However, winzip out of the box will open tarballs and of course zip. And gzip / unzip are pretty much universal on *nix. I have however found that very large tarballs can be a problem with Winzip (like 100+ MB) but that was a long time ago.
  And I would never, for the origin
Zip bad for multiple files with same name (Score:2)

by Jjeff1 ( 636051 ) writes:

My experience is that Zip doesn't handle archiving multiple files with the same name. Zip fails if you have a directory structure like..
foo.txt /images/foo.txt
I've also seen zip fail completely trying to compress a directory structure containing very large numbers of small files > 10,000.

I always use RAR unless I know the recpient can't handle a RAR file.
Whatever you choose... (Score:4, Insightful)

by BinLadenMyHero ( 688544 ) writes: <binladen@9[ ]ls.org ['hel' in gap]> on Wednesday March 09, 2005 @11:55AM (#11889041) Journal

...avoid closed formats.
Using Free software will help you archive your number one goal: that everyone can access the data, now and forever.

Share
twitter facebook
- Re:Whatever you choose...Agreed (Score:1)
  
  by marcus ( 1916 ) writes:
  
  Don't worry about cross platform stuff. Choose something open and it can be ported, even to new, currently non-existant platforms.
  
  Open algorithms, open source, no BS, makes your choice easy.
  - Re:Whatever you choose...Agreed (Score:2)
    
    by Meostro ( 788797 ) writes:
    
    If the point of this is to make data available, then cross-platform availability needs to be my primary concern.
    
    I can spec and write and offer to the public the SuperXtreme Archive format, and make my data available only in that format. Unless there is a compeling reason to switch to SXA for other purposes (general adoption), then it's essentially proprietary to my site and won't really be of any use to anyone.
    
    OSS is not the be-all and end-all of utility or availability, only of portability.
Two formats (Score:2)

by Tom7 ( 102298 ) writes:

Make two versions of your file available. Use "zip" for its universality, so anyone on any platform can get your file, if they want.

Then, make a more efficiently compressed one for those who know how to download and use it. Bzip2 seems to be the current favorite, especially for text.
Security (Score:1)

by cswingle ( 24562 ) writes:

Don't forget about security issues. If you intend to mail these files as attachments, ZIP and RAR may be blocked by mail servers because both can be executable under Windows. tar.bz2 may be more difficult for a Windows user to figure out, but at least it's not going to infect their computer without a lot of work on their part.

My other comment is to do some experiments with *your* data -- which format actually yields the best compression rate, and how much more time do you spend doing the compression / un
- Re:Security (Score:2)
  
  by Meostro ( 788797 ) writes:
  
  My other comment is to do some experiments with *your* data -- which format actually yields the best compression rate, and how much more time do you spend doing the compression / uncompression. Is the extra time you spend worth the 5% you get?
  
  My data doesn't exist yet, that's part of the problem. I need one or a few good general-purpose archive formats, the particulars of which may perform better on one dataset than another. Decompression time matters because it will affect end-users, but compression ti
Technical format comparison chart (Score:2, Informative)

by uler ( 583670 ) writes:

I've got a rather technical format comparison chart started up [1]. It's still a draft, but it's pretty complete.
It doesn't directly address relative compression ratios nor benchmarks. And it's mostly about the formats themselves, not the libraries that implement them. But it's still good for a birds-eye view, I think.
[1] http://darbinding.sourceforge.net/about_dar.php [sourceforge.net] (The chart is at the bottom of the page.)
/dev/null (Score:1)

by BSDFreak ( 579789 ) writes:

is the best compression mechanism I've seen. Getting your data back is a bitch, though.
- Re:/dev/null (Score:2)
  
  by bluGill ( 862 ) writes:
  
  You can always get your data back just fine from /dev/random. You just have to figure out where it starts there, which can sometimes be difficult.
  - Re:/dev/null (Score:2)
    
    by aled ( 228417 ) writes:
    
    I found that the restoring process can be enhanced greatly disabling any form of CRC checking when reading from /dev/random...
don't use rar, arj, 7zip, etc (Score:4, Informative)

by mqx ( 792882 ) writes: on Wednesday March 09, 2005 @12:26PM (#11889371)

These formats (rar, arj, 7zip, whatever) are not as widely supported as (a) tar/gz, tar/bz2, (b) gzip/bzip2, (c) zip. You can get (a)-(c) on just about every platform. Numerous times I've downloaded "unzip" to make (c) work. It's so simple.

Another point not mentioned elsewhere: virus scanners: at least with the popular tar/zip formats, you know that virus scanners understand them and can look for problems.

Sure, you may get a few extra features or a little more room out of non-standard archivers, but that's largely not an issue.

Share
twitter facebook
- tar is bad for integrity (Score:2)
  
  by kherr ( 602366 ) writes:
  
  I found tar to be a dated format that has no checksumming of individual files. I ran into a situation where a large tarball was made, and tar tf foo.tar done to verify it. A later attempt at extract failed due to corruption.
  
  There are horrors that arise with tar. First, there are multiple tar record formats. The original tar only supported 14-character file names (original unix file system limitation). Along came a second tar format, but even that ended up with variants. Most people are using the GNU tar fo
stuffit (Score:2)

by Hes Nikke ( 237581 ) writes:

even though it's not free, i'm quite fond of Stuffit's sitx format. the expander is availble as a free (not as in beer) download from http://www.stuffit.com/ [stuffit.com], as well as being included on the mac platform.
- Stuffit also has error correction (Score:1, Interesting)
  
  by Anonymous Coward writes:
  
  Making sure your data doesn't get corrupted should be more of an issue than how compressed you can get it. Sadly, Stuffit is the only thing I can find that has error correction. I'm suprised because error correction is old as the hills. The Bose-Chaudhuri algorithm comes to mind.
  
  I just went on a search for some ten year old data and the first place I found it, it was corrupted. Thank goodness for redundancy. I finally found an uncorrupted version but it took me a couple of days which could have been p
  - Re:Stuffit also has error correction (Score:2)
    
    by Meostro ( 788797 ) writes:
    
    RAR at least has error correction built in, StuffIt isn't the only thing out there.
    
    In this case, corruption is not an issue. I intend to keep redundant backups of the original datasets, so even if the web-based archives get corrupted I will be able to recover the data.
- Re:stuffit (Score:2)
  
  by Kris_J ( 10111 ) * writes:
  
  The new version of Stuffit looks like it will rock. That 25-30% JPEG compression, plus a generally competitive algorithm shows great potential.
  - Re:stuffit (Score:3, Informative)
    
    by Hes Nikke ( 237581 ) writes:
    
    While we are on the subject of JPEG compression, i recently launched http://jpgcrunch.com/ [jpgcrunch.com], which reduces JPEG file sizes losslessly. that can help things too :D
    
    *Battens down the hatches for an incoming barrage of slashdot traffic*
For Longevity (Score:5, Insightful)

by 4of12 ( 97621 ) writes: on Wednesday March 09, 2005 @12:42PM (#11889544) Homepage Journal

Pick any system for which the source code is available, eg .tar.bz2

Anything else is gambling.

I still gamble, but only that a C compiler will exist in the future.

Share
twitter facebook
LZIP (Score:1, Funny)

by Victor_Os ( 677960 ) writes:

lzip, of course http://sourceforge.net/projects/lzip/ [sourceforge.net]
- Re:LZIP (Score:1)
  
  by larley ( 736136 ) writes:
  
  Sure, if you WANT lossy compression. I doubt he'd really want that when he's distributing software... RTFP next time.
- Re:LZIP (Score:1)
  
  by The Bungi ( 221687 ) writes:
  
  C'mon mods, this is funny. Check out the lzip website, you'll get it.
My Preference (Score:3, Informative)

by vbrtrmn ( 62760 ) writes: on Wednesday March 09, 2005 @02:00PM (#11890715) Homepage

For my own personal archives, I have taken the methods from the masters in USENET.

OS X & UNIX: I'm lazy just: tar.gz

For Win32, I back-up a lot more files under win32 than *nix.

Compression
WinRAR [rarlab.com]
Compression Method: Best
Split to Volumes: 20MB
Parity
QuickPar [quickpar.org.uk]
With general settings.

I back-up to decent quality DVD media, as I have had a lot of problems with CD media rotting [mv.com] after about a year.

Share
twitter facebook
You're looking for something that doesn't exist. (Score:2)

by pclminion ( 145572 ) writes:

The various types of data you mention are not uniformly compressable by any single algorithm. Therefore your "compression ratio" criterion is dubious. Compression ratio on what sort of data?
Dictionary-based compression schemes work well on data which might be described as "linguistic," i.e., data which has some kind of grammar describing it. English text, machine code (binaries), source code, HTML, etc. It won't work very well at all on audio or image data, at least not without some kind of preprocessing
- Re:You're looking for something that doesn't exist (Score:2)
  
  by Meostro ( 788797 ) writes:
  
  The various types of data you mention are not uniformly compressable by any single algorithm. Therefore your "compression ratio" criterion is dubious. Compression ratio on what sort of data?
  
  Compresison on any sort of data. Give me a good general-purpose compressor. Give me a good one just for audio, just for video, just for text, whatever. I have no idea what kinds of data i'm going to get, or how much of each kind there might be. A custom compressor for audio (FLAC) will almost always outperform a ge
  - Re:You're looking for something that doesn't exist (Score:2)
    
    by pclminion ( 145572 ) writes:
    
    Compresison on any sort of data. Give me a good general-purpose compressor.
    There isn't any such thing, unless you use a very narrow definition of "general purpose." To me, general purpose would imply that I could throw any sort of data at it I please, and it would do well, provided the data is not just random data. Since we're working with very vague definitions I can only give very vague suggestions.
    To answer your specific questions, a good dictionary compressor is the Flate algorithm used by gzip. Ve
    - Re:You're looking for something that doesn't exist (Score:2)
      
      by Meostro ( 788797 ) writes:
      
      There isn't any such thing, unless you use a very narrow definition of "general purpose." To me, general purpose would imply that I could throw any sort of data at it I please, and it would do well, provided the data is not just random data. Since we're working with very vague definitions I can only give very vague suggestions.
      
      My "general purpose" is exactly what you said, without the "would do well" part, or substituting "would not do horribly" in its place. I know that all algorithms have strengths and
      - Re:You're looking for something that doesn't exist (Score:2)
        
        by pclminion ( 145572 ) writes:
        
        Anything where minute details don't matter will probably be lossy, but probably 80% or more will be text, code and other data that needs to be lossless.
        Whether text compression needs to be lossless is actually debatable. I'm gonna veer a little off topic here, but hey, it's Slashdot...
        Suppose you are compressing English text by Huffman encoding entire words at a time. However, people make typos, so the actual set of words to be encoded will be larger than a set where there were no typos. By first runni
        
        Re:You're looking for something that doesn't exist (Score:2)
        
        by Meostro ( 788797 ) writes:
        
        Interesting, and true. The only problem is that no spell-checker will get everything right, so it may "correct" words to the wrong thing, and that could make a huge difference in meaning.
        
        Don't forget the concept of letter order [cam.ac.uk] in English (and some foreign) text, it might be possible to alpha-sort the interior of words and still present readable text, although there is an example included that shows it might not be the best idea:
        
        A dootcr has aimttded the magltheuansr of a tageene ceacnr pintaet who deid [cam.ac.uk]
        
        Re:You're looking for something that doesn't exist (Score:2)
        
        by pclminion ( 145572 ) writes:
        
        Interesting idea to sort the letters! That would enhance bzip's compression efficiency somewhat. On the decoding side, it would be fairly simple to map the sorted words back to the originals. An ambiguity resolver based on, say, a third-order Markov model might be able to make the right choice most of the time (similar to some OCR cleanup techniques).
        BTW, I sent you an email suggesting a few more data sets you might host on your site.
Being one who also generates multi-GB to of data.. (Score:3, Interesting)

by Trelane ( 16124 ) writes: on Wednesday March 09, 2005 @03:14PM (#11891748) Journal

I've used both gzip and bzip2. I rather like bzip2 for plain text data files, but there is a rather large cost--compressing and uncompressing can take a much longer time than with gzip! These are important considerations to make, especially if you're gonna need to pull this data back off the shelf anytime soon. For this reason (time), I currently use gzip for intermediate-range storage.

Share
twitter facebook
not what you asked, but... (Score:2, Informative)

by leehwtsohg ( 618675 ) writes:

gzip is really horrible in error recovery. Trying to recover data from a damaged tar.gzip file is hard, because gzip does not keep byte boundaries. bzip2 is much better in this respect, and it is much better to recover from problems.
Never backup using tar.gz - use tar.bz2 instead.
Not on topic, but I had to say it, 'cause I've been hurt (nothing like finding that your .tar.gz backup was damaged, just when you need it)
uharc (Score:2, Informative)

by biryokumaru ( 822262 ) * writes:

no one seems to remember uharc these days. uharc is still a tighter compression than rar or bz2. it's hard to come by, though. personally i still use tar bz2, cuz no one can decompress uharcs, and its usually a little better than rars, atleast for source files. i think it was open source at some point, which solves the long-term dilemma faced with flash in the pan formats... if you could find the source.
you need real time web compression (Score:2, Insightful)

by mozkill ( 58658 ) writes:

if you are making a site which is for people to download stuff from then why not use real-time web compression?

1. your web server can display the un-compressed version of the file on the server , 2. then the user starts the download from the browser, 3. the webserver compresses it on the fly and delivers it to the browser which unzips it when its done.

this saves you time from having to zip and unzip all the time and it SERVES your original purpose.
Example: High Voltage SID Collection (Score:2)

by Kris_J ( 10111 ) * writes:

The HVSC [c64.org] is a 40-50MB (compressed) collection of a huge number of small files. They provide the current version in zip and rar format, with a set of incremental upgrades as zips. I would start by looking at their model.
Though personally, I prefer 7-zip and Stuffit (can't wait for the new version).
Large File/Archive Support (Score:3, Interesting)

by Detritus ( 11846 ) writes: on Thursday March 10, 2005 @05:50AM (#11897932) Homepage

When choosing a standard, don't forget to check how the format deals with large files and large archives. I've run into numerous problems with software that can't deal with anything larger than 2 GB, either for individual files or total archive size.

Share
twitter facebook
Whatever happened to ARJ? (Score:1)

by l0rd ( 52169 ) writes:

Maybe this is a bit off topic, but whatevery happened to the ARJ format? It used to be king. One of its best features was being able to arj something onto multiple floppy disks.

Can anyone enlighten me on the fate of this once most favoured compression algorithm?
Best way to distribute archives... (Score:1)

by spywarearcata.com ( 841806 ) * writes:

...is to upload the material to a Gmail account, then send the recipient the account name and key. Let Google handle the data compression, backup, system maintenance, etc.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

One other choice (Score:4, Insightful)

Re:One other choice (Score:5, Informative)

Re:One other choice (Score:5, Insightful)

Re:One other choice (Score:4, Informative)

Re:One other choice (Score:3, Informative)

Re:One other choice (Score:2)

RAR's license is garbage (Score:3, Informative)

Re:RAR's license is garbage (Score:3, Interesting)

Re:RAR's license is garbage (Score:2)

Re:RAR's license is garbage (Score:1)

PAR File Recovery (Score:2)

7-zip: No Drag-n-drop (Score:3, Informative)

Re:One other choice (Score:1)

Re:One other choice (Score:2)

Re:One other choice (Score:2)

Re:One other choice (Score:2)

Re:One other choice (Score:2)

Re:One other choice (Score:2)

Zip (Score:3, Insightful)

Re:Zip (Score:2)

Re:Zip (Score:2)

Re:Zip (Score:3, Informative)

Re:Zip (Score:3, Informative)

Re:Zip (Score:2)

Re:Zip (Score:2)

Re:Zip (Score:1)

Re:Zip (Score:2)

CPIO (Score:4, Interesting)

Re: (Score:2)

Re:CPIO (Score:3, Informative)

Re:CPIO (Score:2)

Re:CPIO (Score:2)

Re:CPIO (Score:2)

It really depends... (Score:3, Insightful)

Re:It really depends... (Score:2, Informative)

Re:It really depends... (Score:2)

Multi-format (Score:5, Insightful)

Re:Multi-format (Score:2)

Re:Multi-format (Score:2)

Re:Multi-format (Score:2)

Re:Multi-format (Score:2)

Look into specific compress utilities, not generic (Score:3, Informative)

Re:Look into specific compress utilities, not gene (Score:1)

Re:Look into specific compress utilities, not gene (Score:2)

My summary.. (Score:4, Informative)

Re:My summary.. (Score:3, Informative)

Re:My summary.. (Score:1)

Know your audience (Score:2)

Re:Know your audience (Score:2)

Re:Know your audience (Score:2)

Re:Know your audience (Score:2)

Re:Know your audience (Score:2)

Re:Know your audience (Score:1)

Re:Know your audience (Score:2)

.zip file ease-of-use beats out saving 4 bytes (Score:3, Interesting)

Re:.zip file ease-of-use beats out saving 4 bytes (Score:1)

Re:.zip file ease-of-use beats out saving 4 bytes (Score:2)

zip or tgz: yes. bzip2 sadly no. (Score:2)

Zip bad for multiple files with same name (Score:2)

Whatever you choose... (Score:4, Insightful)

Re:Whatever you choose...Agreed (Score:1)

Re:Whatever you choose...Agreed (Score:2)

Two formats (Score:2)

Security (Score:1)

Re:Security (Score:2)

Technical format comparison chart (Score:2, Informative)

/dev/null (Score:1)

Re:/dev/null (Score:2)

Re:/dev/null (Score:2)

don't use rar, arj, 7zip, etc (Score:4, Informative)

tar is bad for integrity (Score:2)

stuffit (Score:2)

Stuffit also has error correction (Score:1, Interesting)

Re:Stuffit also has error correction (Score:2)

Re:stuffit (Score:2)

Re:stuffit (Score:3, Informative)

For Longevity (Score:5, Insightful)

LZIP (Score:1, Funny)

Re:LZIP (Score:1)

Re:LZIP (Score:1)