
Ask Slashdot: Free/Open Deduplication Software? 306
First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.
I've wanted deduplication for a long time! (Score:4, Interesting)
That deduplication for NTFS is really interesting [fosketts.net], actually. It's not licensed technology but straight from Microsoft Research and it has some clever aspects to it.
Some technical details about the deduplication process:
Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.
After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.
There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.
Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.
New writes are not deduped – this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.
The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all. So when are we going to see Linux equivalents? Because Linux is getting behind on the new technologies.
Re:I've wanted deduplication for a long time! (Score:5, Insightful)
The most likely answer is when some one is willing to pay for it. What you have described above isn't a trivial effort and it's unlikely someone is going to do the work for free so it will have to wait until someone is willing to pay for the development. Even then it's likely that the developer may keep it closed source in order to recoup the investment.
Acronis (Score:3)
Acronis Backup & Recovery 11 Advanced Server has deduplication (licensed addon) and runs on Linux. At roughly $2000 it ain't cheap. I've never used it so can't comment on how well it does.
Re: (Score:2, Informative)
Acronis is a bloody NIGHTMARE to deal with. We have a mixed shop here, and after seeing what Acronis does on Windows I vetoed the idea of having it on our mission critical linux servers.
I have never seen such a useless backup product before I started working with Acronis. Most backup systems let you set it up once and they WORK. Acronis is always getting itself wedged (dur, a metadata file I miswrote yesterday is corrupt, I will just hang), and when wedged it hangs ALL backup jobs, not just the one that is
Re: (Score:2)
I've been having to hand-hold backup exec for the better part of a year now. We more or less finally have it working, but I wouldn't speak to highly of it.. We use BE's dedup features, and while it seems to work reasonable well, one day we went to make a standard administrative change to the folder (sharing it within BE to another media server), a *poof*, all of the data got corrupted. A week on the phone with Sym
Re:Acronis (Score:4)
Re: (Score:2)
The best backup solution is a damn rsync script that replicates to another server.
Re: (Score:2)
The best backup solution is a damn rsync script that replicates to another server.
Yes, because that provides the ability to restore previous backups, and if someone makes a mistake on the primary, the problem does not wipe out your only good copy, and...
Oh wait...
Re:Acronis (Score:4, Informative)
In Linux, I would avoid any backup system that doesn't support hard links, long paths, file attributes, file access control lists and SElinux contexts.
Some of the "offerings" out there are so Windows-centric that they can't even handle "Makefile" and "makefile" being in the same directory.
In Windows, I would require that it backs up and restores both the long and the short name of files, if short name support is enabled for the file system (default in Windows). Why? If "Microsoft Office" has a short name of MICROS~3 and it gets restored with MICROS~2, the apps won't work, because of registry entries using the short name.
I'd also look for one that can back up NTFS streams. Some apps store their registration keys in NTFS streams.
In all cases, Acronis does not measure up to what I require of a backup program. Also because the restore environment doesn't even work unless you have hardware compatible with its drivers. You may be able to back up, and even boot the restore environment, but not do an actual restore from it.
ArcServe is better - for Linux, it still lacks support for file attributes and the hardlink handling is rather peculiar during restore, but at least handles SElinux and dedup of the backup.
An option for dedup on Linux file systems would be nice - the easiest implementation would be COW hardlinks. But like for Microsoft's new NTFS, you'd need something that scans the file system for duplicates. And it better have an attrib for do-not-dedup too, because of how expensive COW can be for large files, or to avoid file fragmentation for specific files.
Re: (Score:2)
Acronis works fine for me.
I do a daily, full-disk backup to an external drive, excluding *.bt! and the steamapps folder, as well as whatever defaults Acronis uses for temp files.
Acronis writes out a full file every once every 2 weeks, and incremental files the rest of the time.
It keeps the last 2 months of backups before wiping them out.
I have indeed used my backups to restore. It's as simple as sticking the disc in (easier than finding a blank USB drive, keeping one for Acronis, or trying to finagle it in
Re: (Score:3, Informative)
We check backups and run a test restore on each and every server every month (we had this rule before we started with Acronis).
Acronis is awful. Frankly, someone should open a fraud investigation. Acronis products have no business being sold as enterprise backup solutions. Fucking ntbackup is far more reliable.
Right NOW I am wrestling with Acronis Backup & Recovery 11's retarded cousin, Acronis Recovery for Microsoft Exchange. I seriously want to sue the sadistic and incompetent assholes at Acronis for
Re: (Score:2)
Acronis Backup & Recovery 11 Advanced Server has deduplication (licensed addon) and runs on Linux. At roughly $2000 it ain't cheap. I've never used it so can't comment on how well it does.
At $2000 it is cheap. Ever price a Data Domain?
Re: (Score:2)
Re: (Score:2)
What you have described above isn't a trivial effort
nor is Linux itself, or any of the features that it has. Being a bit tricky has never stopped any linux developer from producing such fantastic stuff, why should this be any different to something like LVM, ext4, or GFS?
Re: (Score:3)
Even Richard Stallman agreed that there are cases where GPL is not necessary. So stop being an ass.
Re: (Score:2)
Re: (Score:3)
Re:I've wanted deduplication for a long time! (Score:5, Insightful)
And now, even the next version of Windows Server will contain integrated data deduplication technology! [...] The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.
Well ask anyone who lost documents on DoubleSpace volumes or got corrupted media files on Windows Home Server and they will tell you that even if Microsoft Research says so, it's not something I would put on my production servers any time soon.
Re:I've wanted deduplication for a long time! (Score:5, Funny)
When will Adblock Plus block these Microsoft ads on Slashdot?
Re: (Score:3, Insightful)
This has got to be some sort of smear campaign against Microsoft, because I cannot believe that they would think that bludgeoning people with astro-turf is going to get them sales. The first two articles with MS shilling I saw (today!) I wrote off as just people sharing interesting stuff that happened to come from MS, but thanks to your over-heavy hand, the pattern is clear as a bell now.
So tell me MS marketing people, are you seriously this incompetent, or did a new astro-turf campagin incentive get out of
Re: (Score:2, Insightful)
Heh, the comment is completely on topic, interesting and factual and yet we still have fucktards insisting that any mention of MS must have been paid for.
Get a life.
Re: (Score:2)
If you're referring to GPLJonas' comment as containing "perfect grammar, good use of white, buzz words and directness" then I'll take that as a compliment! Everything but the first three and last paragraphs were copied from my blog post! (http://blog.fosketts.net/2012/01/03/microsoft-adds-data-deduplication-ntfs-windows-8/)
FWIW, I'm not a Microsoft shill or astroturfer. I'm a blogger who happened to write about MS' dedupe yesterday and it got quoted here. I only noticed this thread thanks to all the Slashdo
Re: (Score:2)
No, no he has a point. A typical Slashdot poster can't go through two sentences without some grammatical faux paus.
Re: (Score:2, Flamebait)
Re: (Score:2, Informative)
Interesting link, but it doesn't look like Microsoft has actually released this yet and it is only slated to be released with Win 8 server, and it will come with some caveats.
FTFA:
"It is a server-only feature, like so many of Microsoft’s storage developments. But perhaps we might see it deployed in low-end or home servers in the future.
It is not supported on boot or system volumes.
Although it should work just fine on removable drives, deduplication requires NTFS so you can forget about FAT or exFAT. A
Re:I've wanted deduplication for a long time! (Score:4, Interesting)
Re: (Score:2)
I compressed some of my Steam games recently to free up disk space on my laptop. If I remember correctly, most of them shrunk by 5-10% while some got around 30%.
So not great, but it freed up a few gigabytes that allowed me to install a few more smaller games.
Re: (Score:3)
NTFSs file compression actually rather sucks.
It's fine as long as you use it properly. I use it for IIS logfiles. I want to keep the logfiles but rarely actually access them, and they are append only, and they are plain text. Very high compression at a very small loss of performance.
Compressing binary data in your working set is, as you point out, probably a bad idea, but as long as you don't do anything stupid you shouldn't have any problems.
Re: (Score:3)
One instance by default is brain-dead. Lose that and they're ALL gone, if it happens between dedup and backup. You should have at least two copies, on different media, if possible, always, of any data with more than one reference, as well as backing up your data, of course.
deduplication is just compression (Score:2)
Both deduplication and conventional compression are a questionable idea on a file server which has many clients.
The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.
Yeah right. I'll wait for real usage numbers rather than "the vendor selling this stuff to me says it's fucking awesome".
Re: (Score:3)
Re: (Score:2)
Both deduplication and conventional compression are a questionable idea on a file server which has many clients.
Out of curiosity, why do you say that? I assume what you're saying is true regardless if we're talking Linux or Windows or any OS.
Re:I've wanted deduplication for a long time! (Score:4, Interesting)
I have often wondered why someone doesn't use the rsync algorithm as a basis for this kind of chunking and deduplication. I imagine a FUSE-based filesystem that breaks the application-level files into checksummed pieces and stores both the file fragments and file descriptions into an underlying filesystem. Then it could reconstruct the application-level files on demand, using the description to draw out the right fragments.
From an academic point of view, it already solves the same problem and just needs some repackaging. It breaks arbitrary data into phrases to be identified by checksum and located in another existing corpus of data. It just needs a metadata model to record the structure of the file as composed of these canonical phrases, rather than performing the actual file reconstruction immediately as rsync does now.
From my cynical point of view, I realize someone may have patented the repackaging, in the same way that Apple seems to think they can re-patent every idea "on a smartphone".
Re: (Score:2)
Microsoft's NTFS technology relies on cleanups to take place in the background, it appears to be a nightmare waiting to happen... :|
How so? Worst case scenario is that some recent fraction of your data is possibly partly duplicated.
Re:I've wanted deduplication for a long time! (Score:4, Informative)
Re: (Score:3)
I think file-level de-dupe is usually a lot less effective because it can't accomodate files that differ only slightly but are otherwise the same, whereas block-level de-dupe works with everything.
I also don't know what happens in your scheme when you have "de-duped" a file that's the same in 4 different directories but then one application wants to change "its" version of the file. It sounds like it trashes the file for the three other uses of it since there's no way to automate copy-on-write with your s
Re: (Score:3, Informative)
Isn't de-dup a fairly trivial application for a DB of MD5sums, even if you don't have the chops to use the filesystem at a more fundamental level?
Yes, but in that case, two multi-GB files that share all of their data except one bit will not be deduplicated. The difference between your approach and Microsofts is grounded in the same though-process that make modern compression algorithms better than older ones:
First you treat all files separately, which is really simple but has the drawback of not cross-linking chances for compression/dedup across files. This is what deflate (ZIP/GZIP) and your approach to dedup do. The same data simply gets recompress
Re:I've wanted deduplication for a long time! (Score:5, Interesting)
Also, perhaps the reason that Linux does not support file-system compression on the fly is because it's a horrid idea, and should never actually be used?
Ah, the "Terrible idea" objection.
This is a common objection to implementing ideas on Linux - so common, in fact, that it's successfully held Linux back for at least ten years.
Multi-master LDAP replication? Terrible idea [ietf.org]. Remained terrible for several years after literally every commercial LDAP server on the planet supported multi-master replication, only became non-harmful when OpenLDAP started to support it in version 2.4.
Active Directory support? Such a terrible idea that it's held Samba development back by at least five years. Even now, where Windows Vista deprecates NT4-style policies and 7 deprecates NT 4 domain support altogether (which is about all you get from Samba 3); Samba 4 is considered alpha software.
Some sort of centralised work-together system that integrates email, address book, calendars, task-list? Terrible idea. So much so that Exchange (despite being way too complicated for its own good) is still an extremely popular email solution and the closest you can get to a viable F/OSS alternative either requires your users to completely re-think how they collaborate (yuck) or buy the commercial version simply because the free version lacks vital features.
Free clue to all naysayers who work on F/OSS projects: If you spent as long trying to think of ways to make something work as you do thinking of objections to existing implementations and explaining how you're right and everyone else is wrong, you wouldn't be ten years behind the times.
Dragonfly BSD's HAMMER... (Score:5, Interesting)
...includes dedupe.
There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.
Re:Dragonfly BSD's HAMMER... (Score:5, Funny)
OpenSolaris but not FreeBSD? (Score:4, Informative)
ZFS in FreeBSD 9 has deduplication support. I've been running the betas / release candidates on my NAS for a little while (everything important is backed up, so I thought I'd give it a test). ZFS development in FreeBSD is funded by iXSystems [ixsystems.com], who sell expensive NAS and SAN systems so they have an incentive to keep it improving.
I have a ZFS filesystem using compression and deduplication for my backups from my Mac laptop. I copy entire files to it, but it only stores the differences.
Re:OpenSolaris but not FreeBSD? (Score:4, Informative)
People considering either dedup or compression on FreeBSD should be made blatantly aware of one of the issues which exists solely on FreeBSD. When using these features, you will find your system "stalling" intermittently during ZFS I/O (e.g. your SSH session stops accepting characters, etc.). Meaning, interactivity is *greatly* impacted when using dedup or compression. This problem affects RELENG_7 (which you shouldn't be using for ZFS anyway, too many bugs), RELENG_8, the new 9.x releases, and HEAD (10.x). Changing the compression algorithm to lzjb has a big improvement, but it's still easily noticeable.
My point is that I cannot imagine using either of these features on a system where users are actually on the machine trying to do interactive tasks, or on a machine used as a desktop. It's simply not plausible.
Here's confirmation and reading material for those who think my statements are bogus. The problem:
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012718.html [freebsd.org]
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html [freebsd.org]
And how OpenIndiana/Illumos solved it:
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html [freebsd.org]
Re: (Score:2)
Re: (Score:3)
So for now, let's not use compression on write-heavy volumes, where it adds next to benefit anyway. Problem solved or at least circumvented, no?
Yeah, and ZFS is still a major new feature to FreeBSD, regardless. I know people in the FreeBSD forums talk about running ZFS on root, but if you don't have the hardware you shouldn't run ZFS. We're in Unix land here, the OS isn't going to keep you from doing anything...but that doesn't mean you should!
Re: (Score:3)
ZFS deduplication works on a block level, so it's not only storing the differences. That would require byte-level deduplication.
For example, imagine I had a block size of two bytes, and had the following two files:
02468A
012468A
These two pieces of data are identical except for one inserted byte, but deduplication will be forced to store both files in their entirety. Why? Because even though only one byte changed, none of the blocks are identical:
02 46 8A
01 24 68 A-
This is a fairly common case in many kinds o
Re: (Score:3)
Re: (Score:3)
That's why Apple swapped their monolithic data blobs (for iPhoto, Aperture, etc) in to smaller files when Time Machine was released--like Sparse Bundle Disk Images, for example. Since the data is banded across multiple files, when some data is changed, you only need to back up the affected bands. I believe they marketed this as "Time Machine Aware" and advised third-party apps to adopt this approach. I thought Time Machine's approach was clever given the constraints of a traditional file system (in lieu
Re:OpenSolaris but not FreeBSD? (Score:5, Interesting)
I've got 8GB of RAM in the machine - RAM is so cheap now that it didn't seem worth skimping. It's a 1.6GHz AMD Fusion system. Over GigE, I was getting 40MB/s writes to the deduplicated filesystem, with the load on one core about 100%.
ZFS definitely likes RAM. I'm not sure what the minimum requirements are, but the general recommendation is 'as much as you can afford'. I think 8GB of SO-DIMMS for the mini-ITX board cost about £40, and maxed out its memory, so that was a pretty obvious choice. I'm not sure what happens when the deduplication tables don't fit into RAM, whether it degrades performance or degrades deduplication efficiency. Having 8GB means that a lot of the time it can satisfy reads from RAM.
I'm using it over WiFi 99% of the time, so I'm not too bothered about the performance: it can easily saturate the WiFi link without any problems. . The compression ratio is 1.11x. ZFS only shows the deduplication ratio for the entire pool, not for individual filesystems. That's currently 1.06x for my system, but that's with 1.43TB of data in total, only 266GB on the deduplicated filesystem, so that means that it's saving about a third. Roughly speaking, the extra space used by RAID-Z and the space saved by dedup seem to balance each other out, so (on my backup filesystem) I am using 1GB of hard disk space for every 1GB of data, and still have redundancy so one disk out of the three can fail without losing any data.
Time Machine on OS X does clever things like make a new copy of a 10GB file if 1MB of it has changed, and the deduplication on the NAS translates to a huge space saving for that. For things like DV footage, I don't bother with the dedup.
Roll your own? (Score:2)
you could run a nightly script to find duplicates and then deduplicate them.
an example of this would be find all new files since the last run, checksum them and compare this to the checksum of all previously examined files. Once you find the likely duplicates you can decide how careful you want to be about verifying identity of both data and meta data. For example, do you want to preserve the attribute dates if the data is identical? for some programs the creator dates matter. Likewise file permissions
Re: (Score:2)
Block-level (rather than file-level) de-duplication would be drastically better for virtual machine images, which are most of the largest files I back up. Email spool files (or pst files) are another example of large files that tend to change only in portions of them.
Re: (Score:2)
So long as you never add or remove data from the middle of the file, sure. Otherwise a one byte change in either direction can require re-storing all the rest of the unchanged data because it might not line up on the same block boundaries.
Re: (Score:2)
Re: (Score:2)
Certainly, there are many scenarios where block-level deduplication does wonders. It just doesn't work well in all cases. For example, trying to use deduplication to store incremental database backups. Record sizes are not fixed in a sql dump, so you're likely to waste large amounts of space for small differences. I do use ZFS deduplication to store incremental backups from a variety of machines, and generally it works well, but the SQL dumps were one of the annoying bits.
Re: (Score:2)
Think about things line Virtual Machine images - they could have a *lot* in common, but not everything. DeDupe gets useful once you can work at a sub-file level.
Re: (Score:2)
Nah, dirvish uses the native rsync --link-dest, which is easier.
Re: (Score:3)
Hard links are awesome, but they're limited to a per-file basis. SDFS [opendedup.org] and other block-level de-dupers will only store unique blocks. E.g. storing multiple virtual machine images -- as each image is one huge file, hard links do nothing.
FreeBSD has ZFS (Score:3, Informative)
FreeBSD, and FreeNAS which is bases on FreeBSD, both come with ZFS. Neither is going away anytime soon.
I use both at home and am happy as a clam.
No dedup in FreeNAS (Score:5, Informative)
However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.
Re: (Score:2, Funny)
Best get your eyes checked, I think that was our cat's anus you were in.
My wife and I love to sit around and de-dupe all night.
.
Lessfs is slow on Atom (Score:4, Informative)
I've used LessFS. On my "server" powered by an Intel Atom, it is very very slow. It writes at about 5MB/sec, even when everything is inside a ram disk.
You can't use a block size of 4KB, otherwise write speeds are around 256KB/sec, need to use at least 64KB.
Re: (Score:3)
Re: (Score:3, Informative)
No you didn't.
You got 40MB writing to memory cache possibly, not the ZFS store.
I have a quad disk, 8 core, 8 GIG machine that ONLY does ZFS, Sustaining 40MB/s doesn't happen without special tuning, turning off write cache flushing and a whole bunch of other stuff ... unless I stay in memory buffers. Once that 8 gig fills or the 30 second default timeout for ZFS to flush, the machine comes to a stand still while the disks are flushed, and at that point, the throughput rate drops well below 40MB/s since it i
Re:Lessfs is slow on Atom (Score:4, Interesting)
You got 40MB writing to memory cache possibly, not the ZFS store.
I did? That's interesting. I copied 500GB of data from an external FireWire disk attached to my laptop to the NAS via a GigE connection, yet the NAS only has 8GB of RAM. That's one hell of a compression algorithm they're using for the RAM cache...
Re: (Score:2)
With deduplication, RAM is more important than CPU speed. If you can fit the checksums for every block of data in RAM, then you're golden. If not, deduplication can get extremely slow, no matter how fast the processor.
Just store everything in /dev/null (Score:2)
Look, no duplicates!
Re:Just store everything in /dev/null (Score:5, Funny)
$ md5sum
d41d8cd98f00b204e9800998ecf8427e
$ cat aaah.png >
$ md5sum
d41d8cd98f00b204e9800998ecf8427e
-
Duplicates!
BackupPC (Score:5, Interesting)
Check out BackupPC. Been using it for about 5 years at our company, admittedly a mostly Linux shop, with great results. Deduplication on a per-file basis, block-based transfers via the rsync protocol, and a good web-based UI (at least in terms of function). Thanks to deduplication we are getting about a 10:1 storage compression backing up servers and workstations: a total of 1.28 TB of backups in 130.88 GB of used space.
Backup or live FS? (Score:3, Interesting)
Your post doesn't make it clear if you're looking for a free backup product to replace DataDomain, NetApp, etc. or if you're now wanting to dedup on live filesystems.
If you're looking for a free backup product that supports deduplication, look at backuppc . Powerful and complex, but free. I've used it for years with good results.
dragonflybsd (Score:2, Informative)
So you want dragonfly BSD with a hammer filesystem.
An excellent and stable BSD and an excellent filesystem to go with it. And a very helpful community.
Why not just (Score:2)
Why not just do the following every once in a while:
1. go through all your files,
2. for each file, compute a checksum (e.g. using the unix tools md5sum or sha1sum),
3. for pairs of files giving similar checksum, compare them (optionally) and if equal remove one of them and make it a hard-link to the other.
It would surprise me if there was no free open-source script doing exactly this.
Re: (Score:2)
for pairs of files giving similar checksum
MD5 / SHA1 / most good hashing algorithms will give completely different hashes for a single bit of data difference
remove one of them and make it a hard-link to the other
What then happens when you write to one of the files?
Re: (Score:2)
Dedup is just a marketing word.... (Score:3, Informative)
it needs incredible amount of memory to operate effectively.
from my university notes:
5TB data, average blocksize 64K = 78125000 blocks
for each block the dedup needs 320 bytes so
78125000 x 320 byte = 25 GB dedup table
use compression instead. (eg zfs compression)
Re: (Score:2)
if you've got 5 tb of files, 25gb doesn't seem so huge. besides you assume that it needs to keep track of every block, when you only need to hash every file, but yes, that is a massive burden on resources. and in implementation it would require loads of ram or a large swapfile, and all just to save disc space over burdening processor/ram requests.
Re: (Score:2)
besides you assume that it needs to keep track of every block, when you only need to hash every file
In the context of ZFS, you're working with block-level deduplication.
if you've got 5 tb of files, 25gb doesn't seem so huge
Absolutely true! ZFS will happily store its dedupe block lists in any cache device you've added to a pool, so add a $100 SSD to that pool and get a huge performance boost at the same time.
but yes, that is a massive burden on resources
Depends what you're using it fore. Suppose you're running ZFS with dedupe on your virtualization farm's backend storage. Each of those VMs will start off with the same base image. Create a new VM by simply copying that base image and you'll only pay for t
Re:Dedup is just a marketing word.... (Score:5, Informative)
All dedup operations have a trade-off between disk I/O and memory use. The less memory you use the more disk I/O you have to do, and vise-versa.
Think of it like this: You have to scan every block on the disk at least once (or at least scan all the meta-data at least once if the CRC/SHA/whatever is already recorded in meta-data). You generate (say) a 32 bit CRC for each block. You then [re]read the blocks whos CRCs match to determine if the CRC found a matching block or simply had a collision.
The memory requirement for an all-in-one pass like this is that you have to record each block's CRC plus other information... essentially unbounded from the point of view of filesystem design and so not desirable.
To reduce memory use you can reduce the scan space... on your first pass of the disk only record CRCs in the 0x0-0x7FFFFFFF range, and ignore 0x80000000-0xFFFFFFFF. In other words, now you are using HALF the memory but you have to do TWO passes on the disk drive to find all possible matches.
The method DragonFly's HAMMER uses is to allocate a fixed-sized memory buffer and start recording all CRCs as it scans the meta-data. When the memory buffer becomes full DragonFly dynamically deletes the highest-recorded CRC (and no longer records CRCs >= to that value) to make room. Once the pass is over another pass is started beginning with the remaining range. As many passes are taken as required to exhaust the CRC space.
Because HAMMER stores a data CRC in meta-data the de-dup passes are mostly limited to just meta-data I/O, plus data reads only for those CRCs which collide, so it is fairly optimal.
This can be done with any sized CRC but what you cannot do is avoid the verification pass.. no matter how big your CRC is or your SHA-256 or whatever, you still have to physically verify that the duplicate blocks are, in fact, exactl duplicates, before you de-dup their block references. A larger CRC is preferable to reduce collisions but diminishing returns build up fairly quickly relative to the actual amount of data that can be de-duplicated. 64 bits is a reasonable trade-off, but even 32 bits works relatively well.
In anycase, most deduplication algorithms are going to do something similar unless they were really stupidly written to require unbounded memory use.
-Matt
Re: (Score:2)
You then [re]read the blocks whos CRCs match to determine if the CRC found a matching block or simply had a collision.
I know you already know this, but for the benefit of other readers: ZFS gives you the option of using no deduplication, using SHA256 as the checksum and trusting its results, or using SHA256 but verifying all collisions. I can't really see the point of not verifying collisions in "strong" hashing algorithms, though - it's not like you're going to get SHA256 collisions so frequently that you can't afford to explicitly verify them.
Re:Dedup is just a marketing word.... (Score:5, Informative)
Well, I can tell you why the option is there... it's not because of collisions, it's there to handle the case where there is a huge amount of actual duplication where the blocks would verify as perfect matches. In this case the de-duplication pass winds up having to read a lot of bulk-data to validate that the matches are, in fact, perfect, which can take a lot of time verses only having to read the meta-data.
Just on principle I think it's a bad idea to just trust a checksum, cryptographic hash, CRC, or whatever. Corruption is always an issue... even if the filesystem code itself is perfect and even if the disk subsystem is perfect there is so much code running in a single address space (i.e. the KERNEL itself) that it is possible to corrupt a filesystem just from hitting unrelated bugs in the kernel.
Not to mention radiation flipping a bit somewhere in the cpu or memory (even for ECC memory it is possible to get corruption, but the more likely case is in the billions of transistors making up a modern cpu, even with parity on the L1/L2/L3 caches).
Hell, I don't even trust IP's stupid simple 1's complement checksum in HAMMER's mirroring protocols. Once during my BEST Internet days we had a T3 which bugged out certain bit patterns in a way that actually got past the IP checksum... we only tracked it down because SSH caught it in its stream and screamed bloody murder.
If you de-duplicate trusting the meta-data hash, even a big one, what you can end up doing is turning 9 good and 1 corrupted copies of a file into 10 de-duped corrupted copies of the file.
I'm sure there are many data stores that just won't care if that happens every once in a while. Google's crawlers probably wouldn't care at all, so there is definitely a use for unverified checks like this. I don't plan on using a cryptographic hash as large as the one ZFS uses any time soon but being able to optimally de-dup with 99.9999999999% accuracy it's a reasonable argument to have one that big.
-Matt
Re: (Score:2)
Another side note on DragonFly's HAMMER: de-duplication is implemented both as a daily pass AND can also be enabled for live writes. The daily pass can find all duplicate blocks. The live dedup uses a small fixed in-kernel-memory LRU style record of recent data block CRCs to find de-duplication candidates during live writes. Performance impact is minimal either way as recently recorded CRCs also tend to still have their data in the buffer cache.
The live-dedup mostly exists to get some up-front deduplica
Re: (Score:2)
Re: (Score:3)
HAMMER's online deduplication doesn't need an incredible amount of memory because it will only match duplicate block that it has the checksums of cached. If the checksum is not cached, HAMMER won't deduplicate the block even if a duplicate exists on disk, because it won't know the duplicate exists. That's my interpretation of the coder's explanation, anyhow ( http://leaf.dragonflybsd.org/mailarchive/users/2011-04/msg00044.html [dragonflybsd.org] ).
HAMMER's offline deduplication doesn't have this limitation because it can re-i
Re:Dedup is just a marketing word.... (Score:4, Informative)
Yes, this is correct.
For on-line de-duplication the most optimal case in my view is to only de-dup data which may already be present in the buffer cache from prior recent operations, so the on-line dedup only maintains a small in-kernel-memory table of recent CRCs. This catches common operations such as file and directory tree copying fairly nicely.
The off-line dedup catches everything using a fixed amount of memory and multiple passes (if necessary) on the meta-data, then bulk data reads only for those blocks which appear to be duplicates to verify that they are exact copies.
I've run dedup on a 2TB backup from a VM with as little as 192MB of ram and it works. A more preferable setup would be to have a bit more memory, like a gigabyte, but more importantly to have a SSD large enough to cache the filesystem meta-data. A 40G SSD is usually enough for a 2TB filesystem. That makes the off-line dedup quite optimal and also makes other maintainance and administrative operations on the large filesystem, such as du, find, ls -lR, cpdup, even a smart diff... let alone rsync or other things one might want to run... it makes all of that go screaming fast without having to waste money buying a bigger system or waste money on excessive energy use.
-Matt
Re: (Score:3)
It does sound like a decent tradeoff; ZFS deduplication, I can say from personal experience with my home file server that has a deduplicated filesystem used for incremental backups, is painfully slow when you don't have enough RAM. So the idea that you do a best-effort deduplication on write and a best-quality deduplication offline each night would be a big improvement.
What sort of timespans are involved in doing offline deduplications when multiple passes are required? Is it the kind of thing you can do ev
Re:Dedup is just a marketing word.... (Score:4, Informative)
For our production systems it depends 100% on the actual amount of duplicated data, since bulk data reads are needed to verify the duplication. The number of passes is almost irrelevant because they primarily scan meta-data N times, not bulk data (duplicated bulk data only has to be verified once).
The meta-data can be scanned much more quickly than the verification of duplicated bulk data because the meta-data is laid out on the physical disk fairly optimally for the B-Tree scan the de-dup code issues. So meta-data can be read from the hard disk at 40 MBytes/sec even without the use of a SSD to cache it. Of course, with DFly's swapcache and the meta-data cached on the SSD that scan runs at 200-300 MBytes/sec.
But in contrast, the bulk reads used to validate the duplicate data just aren't going to be laid out linearly on the disk. There's a lot of skipping around... so the more actual duplicate data we have the larger the percentage of the disk's surface we have to read to verify it.
This is an area which I could further optimize in HAMMER's dedup code. Currently I do not sort the bulk data block numbers when running the data verification pass. Not only that but I am scanning a sorted CRC list, so the bulk data offsets are going to be seriously unsorted. Doing so would definitely improve performance, probably quite a bit, but still not be anywhere near the 40 MBytes/sec the meta-data scan can achieve off the platter. It would not be a whole lot of programming, probably a day to do that. Currently isn't at the top of my list though.
What this means, in summary (and even with semi-sorting of the bulk data blocks), is that one can use a bounded amount of ram without really effecting the efficiency of the off-line de-duplication.
-Matt
What is deduplication? (Score:5, Informative)
I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication [wikipedia.org]
Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.
Re: (Score:2, Interesting)
Seriously, at this point on slashdot its been talked about enough that unless you bought your UID from someone, you should be fully aware of what it is from here alone.
Personal Project (Score:2)
There are a few hurdles with deduplication that a piece of software will likely need your input for:
- Does something like the directory/file names matter? If so, does case matter? What about comparing names from ASCII Unicode? Do you compare the DOS 8.3 names also?
- What do you want to do when you find duplicates? Delete the d
Re: (Score:2)
If its not 100% transparent to the user its not good enough. As far as the user is concerned deduplication isn't happening. There are no evident changes to the filesystem they are using (no case changes, no meta data issues, no links ... 100% transparent to the user )
Its done at block level, below the filesystem.
Whether the data being deduplicated is a directory, an alternate stream, the file data, meta data... is irrelevant.
Say you've got a nice big 5GB Outlook.pst file. And you make a backup copy of it, a
write a script (Score:4, Interesting)
md5sum `find . -type f` | sort
...and so on
Use rsync! (Score:2)
I just use rsync from the command line to do deduplication. Been working like a charm for years.
First I sync from the remote directory to a local base directory:
rsync --partial -z -vlhprtogH --delete root@www.mydomain.net:/etc/ /backup/server/www/etc/base/
Then I sync that to the daily backup. Files that have not changed are hard-linked between all the days that share them. It very efficient and simple, and retrieving files is as simple as doing a directory search.
rsync -vlhprtogH --delete --link-dest=/ba
BackupPC (Score:3, Informative)
http://backuppc.sourceforge.net/ [sourceforge.net]
Its primary disadvantage is the logical consequence of all those hard links. Duplicating the backup store, so you can send it offsite, is basically impossible with filesystem-level tools. You have to copy the entire filesystem to the offsite media, typically with dd.
It also can make your life difficult if you're trying to restore a lot of data all at once, like after a disaster. You take your offsite disks that you've dd' copied, hook them up, and start to run restores.
The hard links mean lots and lots of disk head seeks, so you are doing random i/o on your restore. This is really slow. If I ever have to do this, my plan is to buy a bunch of SSD's to copy my backup onto. Since there are no seeks on SSDs it will be much faster.
Easy, use OpenIndiana or NexentaStor (Score:5, Informative)
Squashfs creates deduped and compressed archives (Score:3)
Try Squashfs which creates deduplicated and compressed filesystem archives (http://www.linux-mag.com/id/7357/ for a good journal article).
If you're using Ubuntu, Debian, Fedora Squashfs will be already built into your distro kernel, and the squashfs-tools will also already be available in your distro repository.
LVM magic (Score:3)
There's a form of deduplication supported by the Linux kernel, if you use the logical volume manager. If you create base LVM device, and then create a snapshot of that device, the snapshot only requires sufficient real estate on the host physical volume to store the diffs between the snapshot and the base. You can use this for "freezing" a file system to do back-ups, or for incremental back-ups, or whatever.
My rather limited experience with this is that, if you have more than a few snapshots on a base device, your write performance degrades very raplidy. There's also a hard limit of 255 snapshots per device.
You can also do file-based deduplication with the "rsnapshot" tool, which has been available for many years.
Also also, I haven't kept up, but I seem to recall that ZFS for linux was promising this as a major selling point.
OpenBSD's Epitome (Score:3)
OpenBSD has had the Epitome [openbsd.org] deduplication framework for some time. I believe version 2 is considered production-ready.
FreeBSD + ZFS (Score:3)
Re: (Score:2)
Question is how bright the future will be with Oracle effectively going their own way with ZFS.
Re:FreeBSD (Score:5, Informative)
Re: (Score:2)
It means, however, that ZFS is now forked. ZFS volumes from future Solaris releases may be incompatible with future versions of FreeBSD (or IllumOS or whatnot). And the new approach that they're taking to define which new features are supported is flexible, but opens the door to situations where you can't mount a filesystem because your OS is missing some individual feature that the devs chose not to implement (ZFS currently goes by backwards compatible versions of the filesystem, the new forked version wor
Re: (Score:2)
It means, however, that ZFS is now forked.
True, but this is only a problem if you were planning on ever getting Oracle gear... since this is about free solutions, that shouldn't be an issue :)
Besides, the current ZFS implementation in FreeBSD is compatible with Oracle's version - so it's not currently a practical concern.