File Systems Best Suited for Archival Storage? 105
Amir Ansari asks: "There have been many comparisons between various archival media (hard drive, tape, magneto-optical, CD/DVD, and so on). Of course, the most important characteristics are permanence and portability, but what about the file systems involved? For instance, I routinely archive my data onto an external hard drive: easy to update and mirror, but which file system provides the best combination of reliability, future-proofing, data recovery, and availability across multiple platforms (Linux, OS X, BeOS/Zeta and Windows, in my case)? Open Source best guarantees the future availability of the standard and specification, but are file systems such as ext2 suitable for archival storage? Is journaling important?"
If you are worried about interoperability use FAT (Score:3, Insightful)
If you are not constantly editing the information (and you won't be, it's for archival purposes) the admittedly major downsides of not being journalled and being prone to fragmentation are non-issues. You might run into problem with capacity limits and/or file size limits though.
Don't use FAT (Score:3, Interesting)
I use Ext3 for my backup drive, and this driver [fs-driver.org] for when I need to attach it to a Windows box.
Re: (Score:2)
The first limit is almost a non-issue. The FAT-32 filesystem supports up to 8 TB. Of course, Windows XP can't format a volume over 32 GB, but you can always create the volume in another way---in Windows 98, ME, or Vista; in Linux; or using a third-party formatting tool. Once you have created a larger volume, Windows (even XP) should be able to handle it just fine.
FAT Limitations [microsoft.com]
individual file size limit of 4gb though... (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
It's a fine and stable filesystem in itself, but if it doesn't work on
Re: (Score:2)
I'm not really much of a fan of ext3. I recommend it for cross compatibility, but nothing else. I tried using ext3 for storage of MythTV recordings, and that turned out to be problematic - it would frequently crash if there was too much going on. Recording one show, transcoding another and watch
Re: (Score:1)
By "fanboy" I was more referring to my bringing up the Mac in a Windows/Linux-centered discussion, although I suppose I could put in my 2 and suggest using HFS... OS X and Ubuntu have coexisted nicely on my Powerbook for a few months now and I have yet to see any problems with the Linux support for HFS; on the other hand the Mac ext2 driver crashes constantly.
PSA: Worse comes to worst (Score:1, Interesting)
The best archival filesystem (Score:4, Funny)
Of course, that's just my opinion--then again, I could be wrong.
Re: (Score:3, Interesting)
Matrix codes (Score:1)
I'm sure that a while ago I read about a system that could print encoded data onto paper at a reasonably high density (eg not readable by a human, but easily decoded with a scanner). At a 'plucked out of the air' figure of .25mm x .25mm per 'bit', and an equally 'plucked out of the air' figure of 11 bits of data per byte (to allow for clocking and maybe some error correction), you'd fit about 80kbytes on a single page of A4
What you read about is a matrix code [wikipedia.org] (no, not backwards kana). 4 pixels per mm times 25.4 mm per inch = 100 dpi. Olympus Dot Code [skynet.be], used by Nintendo's e-Reader, is 3 times finer than that, at a bit over 300 dpi, improving data density by an order of magnitude.
Re: (Score:2)
But in theory there's no reason why you can't do a 2D bar code at high resolution across a page. You wouldn't want to use regular toner though, since it sticks to the pages; you'd want to use real ink that sinks in. Preferably pigment inks instead of dyes, too.
The problem
Re: (Score:1)
Re: (Score:3, Insightful)
Try thin metal plates. A little more difficult to etch by hand (which can be alleviated by using the right malleability of gold), but well worth it for the long-term benefits of damage-resistance and portability.
Re: (Score:1)
Unfortunately you'll have to alloy it heavily in order to get the $/byte down!
Re: (Score:3, Insightful)
The downside of gold is that invading Conquistadors (or otherwise no-good people) might try to melt it down into bars or bullion, destroying your data.
Re: (Score:1)
no expects the spanish inquisition! (Score:2)
Re: (Score:1)
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)
Your plan has one flaw--if one tried to record data in gold, it might just happen that people in the future would find the medium (gold) much more valuable than the message.
One only needs to look at ancient civilizations for historical president.Alas, one man's gold is another man's date with a prostitute.
Re: (Score:2)
And let's define "byte" as "inscribed letter".
Re: (Score:1)
seriously, though, I was just commenting about this. Don't use HDDs for archives -- they fail. Use tape, DVD, or some other media-only (i.e. no embedded electronics to fail) device.
Re: (Score:1)
Don't overlook popularity (Score:3, Insightful)
Re: (Score:1)
You could set it up in a RAID-0 (if it's only 2 disks) or RAID-5 with 1 redundancy (3 or more disks). But you could also consider CDFS or UDF... put your data on CDs/DVDs. If you have overly large files, you could use RAR to break them up into little pieces to burn to CD. If you want free online storage for few gigs of very important files, just open up as many gmail accounts as you need, compress your files with RAR into 1MB pieces and upload ea
Re: (Score:2)
Last I checked, RAR compression isn't available on any default installation of Linux, Windows, or Mac.
RAR may be the best or versatile but every time I've had to un-RAR something, I either used a trail version or a cracked version.
Not something I want to trust 15 years later.
Re: (Score:2)
Re: (Score:2)
You meant to say RAID-1 (mirroring), I'm sure.
Re: (Score:3, Insightful)
Re:Don't overlook popularity (Score:4, Informative)
The killer feature back in the day was the first good implementation of disk splitting. But the compression still stands up now.
On my 'if I ever get free time' list is to implement rar's file ordering in GNU tar to see if that helps gzip and bzip2 catch up RAR's compression ratio. But I've no idea if/when I'll ever get around to that.
-- paid-up RAR user since 1996.
Re: (Score:2)
What RAR is good for is cross platform file splitting, parity files (borrowed concept from RAID 5 and co), and the ability to archive without compressing. This is why it is used in the "scene" all the time.
Re: (Score:1)
The command we used:
rar a -m5 -s -mc63:128t -mdg -mcc -en -tsm0 -tsa0 -tsc0 -ri1:10 ${todaysDate}.rar "*"
-m5 == maximum compression
-s == solid archive, the real saver for multiple copies of same file
-mc63:128t == text compression (PPM algorithm
Re: (Score:3, Insightful)
I wouldn't say the same goes for RAR. It's a proprietary format, owned by a company and used mainly for piracy. I know you can extract it on many OS tod
Re: (Score:1)
How Archival? (Score:5, Insightful)
How far into the future are you going to need it? I understand the whole "not wanting to become unreadble," but honestly, no one's going to bother re-implementing a filesystem to look at their old vacation photos. Pick a popular filesystem, and you'll be sure of support down the line. FAT's still doing just fine for itself, and the ISO filesystems for CDs and DVDs will be readable as long as people are making drives for them.
All of the data integrity features on filesystems aren't going to protect against disk failure/media wearing out, and error correction on that scale is beyond the scope of any one disk to handle. Like the department jokingly advised, parity files and other methods can handle this in a robust, media-spanning manner, and protect against everything from a few flipped bits to a whole-disk data loss (assuming you have enough parity data).
I think the reason not much talk about filesystems has been going on is because they're mostly irrelevant for this task. They're designed to handle the issues of a live environment; the issues that archives face are beyond the capability of how you choose to store your data on each piece of media to solve.
Re:How Archival? (Score:4, Interesting)
However, you're correct that both are ubiquitous standards and likely to be readable by all modern operating systems and should be for some time to come.
No Filesystem is Best (Score:2, Interesting)
Now which archive format is best - tar, cpio, etc.? I've heard that cpio is a much simpler underlying format.
And if you have the space, write the archive sequence multiple times on
Re:No Filesystem is Best (Score:5, Insightful)
If you made 2 copies of the archive on the media, and piece 10 of both sets die, you've lost everything. If you made 1 copy of the archive, and a 10% par set, any 10% of the pieces (data and parity both) could die and you'd still have your data. If you made a 100% par set, you could lose half of the data and parity and still recover. And it doesn't matter which portions.
Add to that the fact that if you lost piece 10 in archive 1, and piece 9 in archive 2, it would be not much fun to figure out the dead pieces and make a full archive again. With PAR2, the tool will do the work for you.
Re: (Score:2)
Pack unpar+untar+gunzip along with each part (Score:1)
the advantage of tar is that it has been around forever and will probably be around forever, even though "better" solutions like PAR have been created. I'd be concerned that somebody will come up with a "better" solution than PAR and implementations of PAR might be hard to find in the distant (e.g. decades away) future.
Any harder than implementations of tar, or even implementations of sh if you want to put GNU tar in .shar format? At least .shar is reasonably human readable and can be unpacked if you don't have a Bourne shell handy. My recommendation: On each volume, include two archives: a .shar archive containing the source code for a PAR reassembler, .tar unpacker, and .gzip decompressor in a widely supported language such as C, and a "part" of your .tar archive.
Re: (Score:2)
Re: (Score:2, Informative)
Son, look... (Score:1, Flamebait)
"There have been many comparisons between various archival media". ARCHIVAL is the key word. As in "we don't move these files around a lot". As in "It doesn't make much sense as an end user to discuss the undelying filesystem to something which is used to just have files sit around, as long as it's stable". Buy something that has dedicated commercial support for the next 20-40 years, like the LTO standard and call me in the morning.
bad advice (Score:4, Insightful)
You mean like DEC or any of the other out-of-business dinosaurs?
As someone who has been through this, I can only say: do NOT buy anything that depends on "dedicated commercial support"; the companies and industry standards you think are going to be around for "20-40 years" are probably either not going to be, or they are not going to give a damn about you.
Use open standards and open formats, with multi-vendor support; that's the only way to go. And you need to keep your eyes open and move to new formats and standards as the world changes.
If LTO is the right choice, it's the right choice because of that. But I'm not convinced that LTO is going to be long-lived enough as a standard, no matter how many companies have tied their fortunes to it right now.
Re: (Score:2)
DEC might be gone, but companies still support DLT hardware.
There is no best (Score:1)
Since you want reliability and portability, I recommend DVD+Rs. They are certainly more reliable than an external hard drive, and more portable too.
It really depends on how you use your archive. Since you carry your archive around, I would recommend against an external hard drive since they can be quite fragile.
Your file system choice depends a lot on the storage technology choice. Of course for the previously mentioned, 9660 would meet your needs the be
What about error correction? (Score:5, Interesting)
In hardware, there's often a checksum, ECC/Hamming code, parity bit, Reed-Solomon code, etc. to detect and/or correct for inadvertent bit flips. But, as far as I know, no error correcting information is ever stored within the filesystem itself. Certainly the filesystem tracks how many blocks are dedicated to a particular file, and how many bytes long the file is, and one can always hash the file twelve ways to Sunday to assure that it hasn't changed since it was originally hashed, but none of that helps repair errors to the file should the medium that's being used to store it decay beyond what's already correctable via the medium access hardware.
I can imagine scenarios where, for example, the RAM buffer in a hard drive is upset and perfectly encodes the wrong bit into a file (or even multiple stripes + parity in a RAID). In this case, the medium access hardware is useless (the data was, after all, ecoded perfectly wrong), but ECC in the filesystem would detect and potentially correct the error the next time the file was read back, even if it were decades later. I appreciate that it would add overhead, and thus maybe shouldn't be the default, but I don't see it being even an option anywhere, and some people would pay the performance penalty to get the data integrity benefit.
Especially in instances like encrypted (or compressed, or both) loopback file systems where one bad bit can destroy an entire partition, why don't we have more data assurance layers available? Or have I just not found them?
Whining of which, what was the deal with GNU ecc? Everyone speaks of "oh, yeah, the algorithm was deeply flawed, bummer..." but I don't ever see any details
Re: (Score:2, Informative)
Re: (Score:2)
In an archival setting, I'd rather get back corrupted data than no data at all. A filesystem that aborts on checksum errors would therefore be a bad choice when faced with that problem.
The question isn't so theoretical. NAND flash requires forward error correc
Re: (Score:2)
Re: (Score:2)
.par2 (Score:2)
I've been wondering lately why no common file systems seem to implement error correcting codes (ECC/EDAC).
Because user mode tools such as PAR2 [wikipedia.org] already implement them.
I can imagine scenarios where, for example, the RAM buffer in a hard drive is upset and perfectly encodes the wrong bit into a file
Likewise, I can see scenarios where, for example, the RAM buffer in an application's main memory or in the file system's buffer is upset and perfectly encodes the wrong bit into a file.
Agreed (Score:2)
Being able to transparently divide files above 4gig and have them look like a single file to a supported OS would be gravy.
Re: (Score:2)
My understanding is th
This question keeps popping up (Score:4, Insightful)
My take on this problem is that you should use the best you reasonably can today. Then in 5 years time when there is a new technology out there, move over to that for archiveing your new data AND move your old data over while you still have working hardware.
I went from floppy disks to LS-120 drives. From LS-120 drives to CDs. From CDs to DVDs. I'll go from DVDs to whichever of HD or BD seems best in a couple of years (unless something else crops up). I might use hard drives instead but I'm not sure yet. The point is I don't need to decide until I need to store that much.
If you're playing in the big leagues do the same with the various formats of giganto capacity tape storage etc.
Plan around the shelf-life and working life of the hardware you can get and the answer drops out.
Hardware isn't everything (Score:3, Insightful)
Sure, I could use archives with checksums or RAID, but it'd be nice if there was an option to sacrifice some speed and space on a single form of storage to improve the reliability without going to such cumbersome lengths.
Re: (Score:2)
I've not tried using dvdisaster but it does seem to fit these requirements.
Simple... (Score:4, Interesting)
Ext2 fs mounted rw,sync. When just reading, or just writing, async can't possibly help performance. You're strictly limited by disk I/O. Async will, however, cause irrecoverable corruption if there's a system crash or power failure, which was a source of great frustration with Linux before the journaling filesystems came along.
Ext2 can be read by nearly even operating system out there, and doesn't have the numerous limitations of FAT32.
Which, incidentally, is the exact same answer I gave a few months ago, when the last guy wrote an Ask Slashdot to ask the exact same question...
Re: (Score:1)
Re: (Score:2)
Downloading Explore2fs isn't all that difficult.
Official or not, with Darwin and Ext2 both being open source, it should be quite easy for anyone who cares enough to want to do it.
With any archival process... T
Re: (Score:2, Insightful)
Re: (Score:2)
As I said... *IF* somebody cares enough. Apparently, noone does.
You're making a lot of assertions and speculation there. Most of which I don't happen to believe.
Even if your premise was true, there's no way I could possibly guess why nobody has felt the need to do it. And the fact that it doesn't exist certainl
Is "sticktion" still a problem? (Score:2, Interesting)
If you leave a drive in a closet for 10 years, will it still spin up?
Re: (Score:1)
Re: (Score:2)
Non-IT answer (Score:5, Interesting)
If you want archival storage, you need to have your data on- or near-line, and rewrite the data to the "new" hardware every couple of years. By choosing a filesystem that is current, you are more likely to be cable to read it in a couple years than if you (try to) stick with a single filesystem. I know this sounds like a lot of work, but if the data is truly worth archiving, it's worth keeping both the storage mechanism and format up to date.
Re: (Score:2)
I think it is better to use FAT32 since both can read it.
What do you think?
Worry about the hardware, not software (Score:5, Interesting)
Personally, I just throw stuff on external hard drives. 3-5 years later, the new drives are so much bigger, faster, and cheaper that it becomes economical to consolidate to a new drive. I still have data from a 286 that had nothing but floppies, an Apple ][e, and 2 dead Macintoshes. Even my old Windows 95 computer lives on as a VirtualPC image. I don't really use them that much, but the Apple ][e and 286 stuff is under 50 megs, and the VirtualPC image is 2GB. The images of the old Mac hard drives total less than 1GB... it's simply not worth deleting them and it's kind of fun to have my old computers still around, if only "virtually".
Re: (Score:3, Funny)
Re: (Score:2)
The problem that many organisations face today is the long term storage of data, however it is not a simple matter of just archiving data, it is knowing if you can retrieve and reuse that data. Alright I have over simplified
Ext2, GNU tar (Score:2)
Re: (Score:1)
For most people this is a recipe for disaster. He was smart enough to know what an ext2 partition was and just smart enough to destroy most all of the acc
I use ext3 (Score:3, Insightful)
- it is much better and more reliable than FAT32
- it is both open source and (relatively) widely used, so I expect there will always be some way to read it
- it can easily be read by attaching it to any machine and booting some Linux LiveCD or bootable USB.
- the OS which traditionally can read ext2/3 is itself open source and also widely used, so there is no fear that it would become unavailable
For archival and backup, I feel all these advantages far outweigh the slight inconvenience that the disks are not readable directly by Windows and Mac, requiring either a driver or a reboot into Linux.
The important point is to label the disks very clearly. Otherwise, someone connecting them to a Windows or Mac machine may believe the disk is empty and re-partition/re-format it! I would not only put a big explanatory label on the disk's case, but also name the volume something like "Linux-..." or "Linux-ext3-...", and also explain to persons involved (manager(s) + people handling the disks) that they are not readable in Windows (some people don't read even big labels...).
Keep it simple (Score:2)
Tape (Score:3, Informative)
If it's your job, sure, you'll do it whether it's convenient or not.
If it isn't, you'll quickly get tired of messing with CDs, plugging/unplugging hard drives, etc. So I went with the most convenient media possible: tape. Stick a tape into the drive, walk away, store when it spits it out. It doesn't interfere with the computer's usage since nothing else uses tape.
For absolute convenience, get a tape robot from ebay. Then it can be completely automatic.
Filesystem: use plain tar to write to the tape. If you must use compression, compress files individually, not the whole tape.
Paranoid implementation: Tapes have file marks. You can ask the tape drive to give you file #1 for instance. You can use this to store some useful stuff in a format that will always be recoverable so long you have a drive that can read the tape. Store like this:
File 1: Text document explaining what's all this stuff, and what's on the tape.
File 2: RFC for tar format
File 3: RFC for compression format
File 4: source for tar program
File 5: source for decompression program
File 6: backup
A tape formatted like this should be readable so long a drive capable of reading the data in it survives. To ensure that, go with a popular tape format, which is reliable, open, and has a high capacity (so that it's unlikely to become obsolete too fast)
Take a look at Venti (Score:1)
ZFS - FTW (Score:4, Informative)
I've thrown all kinds of problems at it, and it has yet to lose a single byte of data.
Add to that, taking snapshots every (x) minutes, you can look back in time as easily as reading a folder.
With RAIDZ2 in the latest releases, you can set up sets that can withstand the loss of 2 physical drives. If you couple multiple RAIDZ2 sets into a single pool, you've increased the redundancy even further. With plain old JBOD and multiple controllers, you can reach levels of availability that only expensive EMC/Hitachi/StorEdge systems have reached in the past.
It's opensource as well (although it's the Sun flavor at this time), and being worked on at www.opensolaris.org. I believe Sun is contemplating switching it to GPL at this time.
Re: (Score:2)
What are your parameters? (Score:2, Interesting)
Do you need storage for years, decades, centuries, millennia, 10,000 years, or longer?
Do you need an indexing system based on content or just on title/filename?
Can the data be printed out or carved into stone without losing important information?
Is this a go-to-jail-if-you-don't legal requirement, a may-go-bankrupt-if-you-don't business requirement, or a save-us-a-bunch-of-money-nice-thing-to-have requirement?
Do you think the cost of res
Re: (Score:1)
My hunch: For most applications involving less than 50 year data retention, making 2 copies of the raw data, to a currently supported stable media such as tape or archival DVD, stored in separate locations, is key. Make sure the data is both in the original format and in a published-standard format which is widely supported. Keep multiple machines that can read the data around for as long as you need the original format. Every few years or as needed, verify the data is intact, re-convert the data from the original format or, if that format is unreadable, the highest-fidelity published-standard format, to a currently-supported published standard, and save it to a currently-supported archival format.
Interestingly enough, this is very similar to the process developed by the National Archives of Australia http://www.naa.gov.au/recordkeeping/preservation/d igital/summary.html/ [naa.gov.au]. They are saving the 'original' document and a version converted to an open format (eg Open Document Format for word processing documents). If the format changes, they will use the converted version to generate something in the new format. They will be doing it for stuff that needs to be kept a lot longer than your arbitrary 50 ye
define really important (Score:1)
I'd say "really important" is stuff that needs to survive a collapse of technology or even civilization, but not the collapse of literacy. Things like the Rosetta Stone or the modern equivalents, basic instructions for subsistence farming, core religious texts such as the Bible and Koran, dictionaries, some history books, instructions for making a printing press and other basic inventions that could have been built 4,000 years a
for hard disk media? Sun's ZFS, hands down (Score:2)
"Why ZFS for home": - http://uadmin.blogspot.com/2006/05/why-zfs-for-hom e.html [blogspot.com]
"Here are ten reasons why you'll want to reformat all of your systems and use ZFS.": http://www.tech-recipes.com/rx/1446/zfs_ten_reason s_to_reformat_your_ [tech-recipes.com]...
And some more technical explanations from Sun's Chief Engineer: - http://blogs.sun.com/bonwick/entry/zfs_end_to_end_ dat [sun.com]
Re: (Score:1)
Look! There's even some blogs about it. What could go wrong?
Re: (Score:2)
Now, couple this with Sun's test lab, where they've subjected ZFS to MILLIONS of intentionally data disrupting incidents, such as - reformating hard drives in the pools, removing power from hard drives, writing random data to disks in the pools, pulling SCSI cables from systems, physically powering off the system, re-cabling the disks and boxes so that they are on
My recommendation: HFS Extended (Score:2)
ZFS (Score:2)
This is great. Previously I had implemented a fax archive for a client and it was getting corrupted periodically because of some ext3 fi
The only format universally accessible (Score:2)
Is the VFAT 32-Bit MS-DOS file system made available in Windows 95. On CD-ROM/DVD-ROM it's essentially the same as the "Joliet" format. It supports file names up to 63 characters, subdirectories, and blanks in file names. Now, I could be wrong but I think journaling is only important where you have transactional-based file systems, where you are doing update writes and want faster performance with the ability to recover in the event of failure of a transaction to finish, i.e. the computer is rebooted bef
My Suggestion (Score:2)
This cuts the field down. ISO 9660 would be a good bet, but is a bit "overkill". TAR format (which can be viewed as a "primitive" filesystem) would be my choice. Simple, can be read on all your target systems. If a tar client is not (for whatever reason) easily available, the data can still be simply extracted.
Bad point: the "directory" can only be obtained by scanning the entire byte stream. If that is tolerable (and, by indexing the fi