Best Format for Archive Distribution? 109
Meostro asks: "I'm looking for the best format to use to distribute arbitrary datasets. Tarballs compressed with gzip seem to be the most common thing out there, with zip coming in a close second. What advanced compression packages are the most widely recognized or available on the widest array of systems? Cross-platform compatibility is my most important goal, followed by compression ratio, decompression time, compression time and extra features (solid archives, support for multiple files, etc.). I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more. I hope this will grow to be a significant (read: multi-gigabyte) archive, so I want to start off right with my distribution format. Right now the plan is data.tar.bz2, but i'm open to anything that will give me better compression as long as it's available for Linux, Windows and Mac."
One other choice (Score:4, Insightful)
RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.
Re:One other choice (Score:5, Informative)
Its linux verion is called unace and there is a macunace as well. Sadly these programs are a bit harder to find on the website but they are there.
Luckly gentoo knows it so you can simply emerge unace.
Re:One other choice (Score:5, Insightful)
If you want to de-compress this stuff in 10 or 20 years, will you be able to find software then that can handle it? Epspecially if the new cell processors somehow become popular, will Windows BOHICA 2025 edition be able to run 20-year-old binaries in order to read this thing?
If the source is available, the job is easier in Linux, but if the format is not actively maintained, it may take a lot of work to modify the program to run whatever Linux looks like in 20 years.
Re:One other choice (Score:4, Informative)
Give me a foo.tar.Z file from the early 80s and I can still uncompress it. Give me a foo.zip from a mid-80's BBS archive and I can still see what's insite.
Also, see graphics formats.
Re:One other choice (Score:3, Informative)
Re:One other choice (Score:2)
RAR's license is garbage (Score:3, Informative)
md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.
RAR doesn't have a monopoly on integrating these, though. 7zip certainly has many of the service features available in rar.
Re:RAR's license is garbage (Score:3, Interesting)
You're right about the license, though.
Re:RAR's license is garbage (Score:2)
You're right. It is clever & is better than md5 for repair purposes. Which is why is aaid md5 was probably enough for test data (I think bz2 archives are the way to go, so I'm a bit biased).
But, once again, this is not a unique feature of rar. I know I've seen proprietary ZIP pro
Re:RAR's license is garbage (Score:1)
I'm not trying to trash 7-zip, but I'm also not going trying 7-zip when RAR has the track record it does for doing what it was designed to do.
PAR File Recovery (Score:2)
7-zip: No Drag-n-drop (Score:3, Informative)
or at least manage its SourceForge support and RFE forums.
Drag-n-drop has been requested for almost 2 years and now,
some of its users are defecting to TUGZIP because of it.
http://sourceforge.net/tracker/index.php?func=det a il&aid=663095&group_id=14481&atid=364481 [sourceforge.net]
Either the guy is too busy, doesn't care or just doesn't want to share control.
Maybe it's time to fork 7-zip?
Re:One other choice (Score:1)
Now that's what I call comedy!
Re:One other choice (Score:2)
Re:One other choice (Score:2)
Re:One other choice (Score:2)
Re:One other choice (Score:2)
Re:One other choice (Score:2)
Zip (Score:3, Insightful)
Zip is miles more common than anything else and compresses better (generally) than gzip. It's supported out of the box on almost every OS either natively or with bundled software. Even Solaris comes with unzip.
Forget
-Isaac
Re:Zip (Score:2)
BTW, WinZIP can handle
Re:Zip (Score:2)
Re:Zip (Score:3, Informative)
Zip compresses each file in an archive individually.
Tar+gzip compresses the entire contents as a whole - meaning better
compression than zip archives (unless you add uncompressed files to
an archive, THEN compress the entire archive..)
WinZip supports tar+gzip archives, from what I remember, but WinRAR
supports
on Windows?
Then again, you could use solid RAR archives. Generally the best
size+performance ratio I
Re:Zip (Score:3, Informative)
According to the ZIP file format specification [pkware.com], ZIP can use a dynamic LZW algorithm.
The whole reason gzip exists is because the standard UNIX compress uses LZW [gzip.org] - which, until recently, was protected by a patent (that was the problem with GIFs).
Instead of using LZW, gzip uses the unprotected LZ algorithm, which doesn't contain the improvements that Welch (the 'W' in LZW) made.
So not only do they not use the same algorithm, but that's the whole point of gzi
Re:Zip (Score:2)
Thank you. I thought this was common knowledge.
-Isaac
Re:Zip (Score:2)
So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!
Well, let's look at a quote from the gzip page you linked:
There you have it. Zip uses d
Re:Zip (Score:1)
I know. That's what I said. compress uses LZW.
ZIP can use multiple algorithms, one of them being LZW - the very algorithm that gzip was created to avoid. ZIP and compress both use this algorithm, gzip does not.
No, deflate is just one of the algorithms that ZIP can use. LZW is
Re:Zip (Score:2)
Actually, I've done some research, and a few [frugalcorner.com] sources [info-zip.org] tell me that LZW is called "shrink" in zip vernacular and was only commonly used in the days of PKZip 1.1. It moved to Deflate as the default after that, and indeed, Info-Zip's unzip utility doesn't even enable unshrink by default. If LZW in zip files were common, that wouldn't be a very pragmatic thing to do, would it?
Every zip utility out th
CPIO (Score:4, Interesting)
Re: (Score:2)
Re:CPIO (Score:3, Informative)
1. No obvious support on Windows for cpio
2. These are going to be test files, so I shouldn't have to worry about special devices / links / sparse files.
I know our admin here at work switched the backups from tar to cpio for exactly that reason, but it's just not universal enough to justify the departure from "normal".
Re:CPIO (Score:2)
Kirby
Re:CPIO (Score:2)
Once you get it to do something (other than the "find. -depth | cpio -pdl
Re:CPIO (Score:2)
Or, use pax. It's got a much nicer syntax than cpio, and can also handle cpio, pax, and tar formats.
We used to use a horrible combination of cpio, bzip2, and split to image our servers. Was a royal pain to use, especially if you only wanted 1 or 2 files out of the backup. Switched to pax on Linux and bsdtar on *BSD, and everything is just hunky-dorry.
Not com
It really depends... (Score:3, Insightful)
tar/tar.gz/tar.bz is supported out-of-the-box on Linux and Mac OS X, but can throw Windows users for a loop (easily remedied, but they aren't likely to have untar installed, and will find the file extension at least a bit odd). For some data tar.bz will result in noticeably smaller files, but at a greater cost of compression/decompression time.
After that, you're not really going to find an archival format that's really common.
In the end, it depends on what type of data you are archiving, and your target audience, but unless you have a specific reason otherwise, zip with an md5 checksum file is probably the solution of least effort (just make sure you back-up the archive--don't want to have a problem with the only copy you have!).
Re:It really depends... (Score:2, Informative)
Re:It really depends... (Score:2)
Michael.
Multi-format (Score:5, Insightful)
Have you considered going multi-format?
Either increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)), or use compression-on-the-fly libraries (behind a cache to reduce server load). This would allow the recipient to decide, and end up supporting perhaps a larger population.
Re:Multi-format (Score:2)
That's where i'm leaning, something like
Re:Multi-format (Score:2)
That is amusing. I've thought about having a
Re:Multi-format (Score:2)
Re:Multi-format (Score:2)
Look into specific compress utilities, not generic (Score:3, Informative)
e.g. for:
text,ascii,documents: use any of bzip2, gzip, zip.
audio: use nothing if MP3/AAC/etc. flac for other "raw" formats
video: use the most appropriate encoding (mpeg4/divx,etc) and then don't try to compress.
bzip encodes/decodes slower, but has typically better compression ratios.
So, use whatever people commonly use for the data type you are compressing.
gus
Re:Look into specific compress utilities, not gene (Score:1)
(lotsa files) -> compress -> (one archive) -> de-compress -> (lotsa files)
Audio and video codecs do not create an archive, and I think that the point is to have a general process, without having a bunch of exceptions based on file type.
BTW: all audio and video codecs (except FLAC) are lossy. Data out != data in.
Re:Look into specific compress utilities, not gene (Score:2)
I'd be perfectly happy to have separate archivers for different formats, but my main concern is universiality, even above compression ratio. FLAC or Monkey's for audio sounds great, but I need to be sure that everyone will be able to handle it. That's why I might be stuck with
My summary.. (Score:4, Informative)
Everyone has an unzipping program. I find on windows more people have (and get) rar than bzip2, particularily if they are afraid of command lines. 7zip gives the best compression of everyone, so is particularily useful for big datasets (I often send around 9-20GB data sets) but eats memory like no-ones business.
It really comes down to how much you want to make people download compared to how much trouble you want the to go to.
If you want to be "minimal effort", I'd advise providing a
Re:My summary.. (Score:3, Informative)
Re:My summary.. (Score:1)
Know your audience (Score:2)
Your audience will be developers. Hopefully F/OSS developers. So distribute in a developer-friendly format. BZ2 files can be decompressed with Free software & most of the proprietary applications out there that would be decompressing alternative formats.
While 7-zip is a nic
Re:Know your audience (Score:2)
Re:Know your audience (Score:2)
Re:Know your audience (Score:2)
Neither RAR binary code, WinRAR binary code, UnRAR source or UnRAR binary code may be used or reverse engineered to re-create the RAR compression algorithm, which is proprietary, without written permission of the author.
and the source code:
The unRAR sources may be used in any software to handle RAR archives without limitations free of charge, but cannot be used to re-create the RAR compression algorithm, which is proprietary. Distribution of modified unRAR sources in separate
Re:Know your audience (Score:2)
And there lies the problem that I mentioned. The compression algorithm isn't Free.
But why bother catering to his business when you could use something like 7-zip?
It also makes sense from a business perspective: formats which are distributed gai
Re:Know your audience (Score:1)
Re:Know your audience (Score:2)
7-zip is also on the win32 boxes I administer & winzip isn't. But this kind of use is not enough to prove the kind of maturity that I'm talking about. One piece of evidence is that most users aren't using it for 7zip archives. Another is that the *nix version of 7-zip is about 8 months old and listed as beta [sourceforge.net]. Only in October did KDE [kde.org] and GNOME [gnome.org] add support IN THEIR CVS. S
.zip file ease-of-use beats out saving 4 bytes (Score:3, Interesting)
Re:.zip file ease-of-use beats out saving 4 bytes (Score:1)
Re:.zip file ease-of-use beats out saving 4 bytes (Score:2)
zip or tgz: yes. bzip2 sadly no. (Score:2)
I recently tried unpacking a bzip2 package under windows. It took me ages to find something that would recognize it and extract it. Which is a shame because it is a nice format ... if you aren't doing this a lot since it takes more time.
However, winzip out of the box will open tarballs and of course zip. And gzip / unzip are pretty much universal on *nix. I have however found that very large tarballs can be a problem with Winzip (like 100+ MB) but that was a long time ago.
And I would never, for the origin
Zip bad for multiple files with same name (Score:2)
foo.txt
I've also seen zip fail completely trying to compress a directory structure containing very large numbers of small files > 10,000.
I always use RAR unless I know the recpient can't handle a RAR file.
Whatever you choose... (Score:4, Insightful)
Using Free software will help you archive your number one goal: that everyone can access the data, now and forever.
Re:Whatever you choose...Agreed (Score:1)
Open algorithms, open source, no BS, makes your choice easy.
Re:Whatever you choose...Agreed (Score:2)
I can spec and write and offer to the public the SuperXtreme Archive format, and make my data available only in that format. Unless there is a compeling reason to switch to SXA for other purposes (general adoption), then it's essentially proprietary to my site and won't really be of any use to anyone.
OSS is not the be-all and end-all of utility or availability, only of portability.
Two formats (Score:2)
Then, make a more efficiently compressed one for those who know how to download and use it. Bzip2 seems to be the current favorite, especially for text.
Security (Score:1)
My other comment is to do some experiments with *your* data -- which format actually yields the best compression rate, and how much more time do you spend doing the compression / un
Re:Security (Score:2)
My data doesn't exist yet, that's part of the problem. I need one or a few good general-purpose archive formats, the particulars of which may perform better on one dataset than another. Decompression time matters because it will affect end-users, but compression ti
Technical format comparison chart (Score:2, Informative)
It doesn't directly address relative compression ratios nor benchmarks. And it's mostly about the formats themselves, not the libraries that implement them. But it's still good for a birds-eye view, I think.
[1] http://darbinding.sourceforge.net/about_dar.php [sourceforge.net] (The chart is at the bottom of the page.)
/dev/null (Score:1)
Re:/dev/null (Score:2)
You can always get your data back just fine from /dev/random. You just have to figure out where it starts there, which can sometimes be difficult.
Re:/dev/null (Score:2)
don't use rar, arj, 7zip, etc (Score:4, Informative)
These formats (rar, arj, 7zip, whatever) are not as widely supported as (a) tar/gz, tar/bz2, (b) gzip/bzip2, (c) zip. You can get (a)-(c) on just about every platform. Numerous times I've downloaded "unzip" to make (c) work. It's so simple.
Another point not mentioned elsewhere: virus scanners: at least with the popular tar/zip formats, you know that virus scanners understand them and can look for problems.
Sure, you may get a few extra features or a little more room out of non-standard archivers, but that's largely not an issue.
tar is bad for integrity (Score:2)
There are horrors that arise with tar. First, there are multiple tar record formats. The original tar only supported 14-character file names (original unix file system limitation). Along came a second tar format, but even that ended up with variants. Most people are using the GNU tar fo
stuffit (Score:2)
Stuffit also has error correction (Score:1, Interesting)
I just went on a search for some ten year old data and the first place I found it, it was corrupted. Thank goodness for redundancy. I finally found an uncorrupted version but it took me a couple of days which could have been p
Re:Stuffit also has error correction (Score:2)
In this case, corruption is not an issue. I intend to keep redundant backups of the original datasets, so even if the web-based archives get corrupted I will be able to recover the data.
Re:stuffit (Score:2)
Re:stuffit (Score:3, Informative)
*Battens down the hatches for an incoming barrage of slashdot traffic*
For Longevity (Score:5, Insightful)
Pick any system for which the source code is available, eg .tar.bz2
Anything else is gambling.
I still gamble, but only that a C compiler will exist in the future.
LZIP (Score:1, Funny)
Re:LZIP (Score:1)
Re:LZIP (Score:1)
My Preference (Score:3, Informative)
OS X & UNIX: I'm lazy just: tar.gz
For Win32, I back-up a lot more files under win32 than *nix.
Compression
WinRAR [rarlab.com]
Compression Method: Best
Split to Volumes: 20MB
Parity
QuickPar [quickpar.org.uk]
With general settings.
I back-up to decent quality DVD media, as I have had a lot of problems with CD media rotting [mv.com] after about a year.
You're looking for something that doesn't exist. (Score:2)
Dictionary-based compression schemes work well on data which might be described as "linguistic," i.e., data which has some kind of grammar describing it. English text, machine code (binaries), source code, HTML, etc. It won't work very well at all on audio or image data, at least not without some kind of preprocessing
Re:You're looking for something that doesn't exist (Score:2)
Compresison on any sort of data. Give me a good general-purpose compressor. Give me a good one just for audio, just for video, just for text, whatever. I have no idea what kinds of data i'm going to get, or how much of each kind there might be. A custom compressor for audio (FLAC) will almost always outperform a ge
Re:You're looking for something that doesn't exist (Score:2)
There isn't any such thing, unless you use a very narrow definition of "general purpose." To me, general purpose would imply that I could throw any sort of data at it I please, and it would do well, provided the data is not just random data. Since we're working with very vague definitions I can only give very vague suggestions.
To answer your specific questions, a good dictionary compressor is the Flate algorithm used by gzip. Ve
Re:You're looking for something that doesn't exist (Score:2)
My "general purpose" is exactly what you said, without the "would do well" part, or substituting "would not do horribly" in its place. I know that all algorithms have strengths and
Re:You're looking for something that doesn't exist (Score:2)
Whether text compression needs to be lossless is actually debatable. I'm gonna veer a little off topic here, but hey, it's Slashdot...
Suppose you are compressing English text by Huffman encoding entire words at a time. However, people make typos, so the actual set of words to be encoded will be larger than a set where there were no typos. By first runni
Re:You're looking for something that doesn't exist (Score:2)
Don't forget the concept of letter order [cam.ac.uk] in English (and some foreign) text, it might be possible to alpha-sort the interior of words and still present readable text, although there is an example included that shows it might not be the best idea:
A dootcr has aimttded the magltheuansr of a tageene ceacnr pintaet who deid [cam.ac.uk]
Re:You're looking for something that doesn't exist (Score:2)
BTW, I sent you an email suggesting a few more data sets you might host on your site.
Being one who also generates multi-GB to of data.. (Score:3, Interesting)
not what you asked, but... (Score:2, Informative)
Never backup using tar.gz - use tar.bz2 instead.
Not on topic, but I had to say it, 'cause I've been hurt (nothing like finding that your
uharc (Score:2, Informative)
you need real time web compression (Score:2, Insightful)
1. your web server can display the un-compressed version of the file on the server , 2. then the user starts the download from the browser, 3. the webserver compresses it on the fly and delivers it to the browser which unzips it when its done.
this saves you time from having to zip and unzip all the time and it SERVES your original purpose.
Example: High Voltage SID Collection (Score:2)
Though personally, I prefer 7-zip and Stuffit (can't wait for the new version).
Large File/Archive Support (Score:3, Interesting)
Whatever happened to ARJ? (Score:1)
Can anyone enlighten me on the fate of this once most favoured compression algorithm?
Best way to distribute archives... (Score:1)