Forgot your password?
typodupeerror
Data Storage

Ask Slashdot: Free/Open Deduplication Software? 306

Posted by timothy
from the the-dept-dept-the-from-dept-from-from dept.
First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Free/Open Deduplication Software?

Comments Filter:
  • by GPLJonas (2545760) on Wednesday January 04, 2012 @03:54PM (#38588736)
    And now, even the next version of Windows Server will contain integrated data deduplication technology! So Linux devs better get working on similar features. I still cannot figure out how NTFS can support compressing files and folders but Linux cannot.

    That deduplication for NTFS is really interesting [fosketts.net], actually. It's not licensed technology but straight from Microsoft Research and it has some clever aspects to it.

    Some technical details about the deduplication process:

    Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.

    After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.

    There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.

    Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.

    New writes are not deduped – this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.

    The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all. So when are we going to see Linux equivalents? Because Linux is getting behind on the new technologies.

    • by nanoflower (1077145) on Wednesday January 04, 2012 @03:59PM (#38588796)

      The most likely answer is when some one is willing to pay for it. What you have described above isn't a trivial effort and it's unlikely someone is going to do the work for free so it will have to wait until someone is willing to pay for the development. Even then it's likely that the developer may keep it closed source in order to recoup the investment.

      • Acronis Backup & Recovery 11 Advanced Server has deduplication (licensed addon) and runs on Linux. At roughly $2000 it ain't cheap. I've never used it so can't comment on how well it does.

        • Re: (Score:2, Informative)

          by Anonymous Coward

          Acronis is a bloody NIGHTMARE to deal with. We have a mixed shop here, and after seeing what Acronis does on Windows I vetoed the idea of having it on our mission critical linux servers.

          I have never seen such a useless backup product before I started working with Acronis. Most backup systems let you set it up once and they WORK. Acronis is always getting itself wedged (dur, a metadata file I miswrote yesterday is corrupt, I will just hang), and when wedged it hangs ALL backup jobs, not just the one that is

          • Never used a product that required as much handholding? I see you've never used Backup Exec.

            I've been having to hand-hold backup exec for the better part of a year now. We more or less finally have it working, but I wouldn't speak to highly of it.. We use BE's dedup features, and while it seems to work reasonable well, one day we went to make a standard administrative change to the folder (sharing it within BE to another media server), a *poof*, all of the data got corrupted. A week on the phone with Sym
            • by jd2112 (1535857) on Wednesday January 04, 2012 @05:42PM (#38590048)
              I've never used any backup solution on any platform that wasn't a complete pice of crap. Backing up to /dev/null is much faster and only slightly less reliable for restores.
              • by DavidTC (10147)

                The best backup solution is a damn rsync script that replicates to another server.

                • by afabbro (33948)

                  The best backup solution is a damn rsync script that replicates to another server.

                  Yes, because that provides the ability to restore previous backups, and if someone makes a mistake on the primary, the problem does not wipe out your only good copy, and...

                  Oh wait...

          • Re:Acronis (Score:4, Informative)

            by arth1 (260657) on Wednesday January 04, 2012 @04:52PM (#38589482) Homepage Journal

            In Linux, I would avoid any backup system that doesn't support hard links, long paths, file attributes, file access control lists and SElinux contexts.
            Some of the "offerings" out there are so Windows-centric that they can't even handle "Makefile" and "makefile" being in the same directory.

            In Windows, I would require that it backs up and restores both the long and the short name of files, if short name support is enabled for the file system (default in Windows). Why? If "Microsoft Office" has a short name of MICROS~3 and it gets restored with MICROS~2, the apps won't work, because of registry entries using the short name.
            I'd also look for one that can back up NTFS streams. Some apps store their registration keys in NTFS streams.

            In all cases, Acronis does not measure up to what I require of a backup program. Also because the restore environment doesn't even work unless you have hardware compatible with its drivers. You may be able to back up, and even boot the restore environment, but not do an actual restore from it.

            ArcServe is better - for Linux, it still lacks support for file attributes and the hardlink handling is rather peculiar during restore, but at least handles SElinux and dedup of the backup.

            An option for dedup on Linux file systems would be nice - the easiest implementation would be COW hardlinks. But like for Microsoft's new NTFS, you'd need something that scans the file system for duplicates. And it better have an attrib for do-not-dedup too, because of how expensive COW can be for large files, or to avoid file fragmentation for specific files.

          • Acronis works fine for me.

            I do a daily, full-disk backup to an external drive, excluding *.bt! and the steamapps folder, as well as whatever defaults Acronis uses for temp files.
            Acronis writes out a full file every once every 2 weeks, and incremental files the rest of the time.
            It keeps the last 2 months of backups before wiping them out.

            I have indeed used my backups to restore. It's as simple as sticking the disc in (easier than finding a blank USB drive, keeping one for Acronis, or trying to finagle it in

        • by jd2112 (1535857)

          Acronis Backup & Recovery 11 Advanced Server has deduplication (licensed addon) and runs on Linux. At roughly $2000 it ain't cheap. I've never used it so can't comment on how well it does.

          At $2000 it is cheap. Ever price a Data Domain?

      • by afidel (530433)
        Another potential pitfall to open source implementations is patents. Due to their usefulness as a product differentiator in the storage business many of the clever implementations have been patented.
      • by gbjbaanb (229885)

        What you have described above isn't a trivial effort

        nor is Linux itself, or any of the features that it has. Being a bit tricky has never stopped any linux developer from producing such fantastic stuff, why should this be any different to something like LVM, ext4, or GFS?

    • by lucm (889690) on Wednesday January 04, 2012 @04:08PM (#38588916)

      And now, even the next version of Windows Server will contain integrated data deduplication technology! [...] The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.

      Well ask anyone who lost documents on DoubleSpace volumes or got corrupted media files on Windows Home Server and they will tell you that even if Microsoft Research says so, it's not something I would put on my production servers any time soon.

    • by Ironchew (1069966) on Wednesday January 04, 2012 @04:09PM (#38588924)

      When will Adblock Plus block these Microsoft ads on Slashdot?

    • Re: (Score:3, Insightful)

      This has got to be some sort of smear campaign against Microsoft, because I cannot believe that they would think that bludgeoning people with astro-turf is going to get them sales. The first two articles with MS shilling I saw (today!) I wrote off as just people sharing interesting stuff that happened to come from MS, but thanks to your over-heavy hand, the pattern is clear as a bell now.

      So tell me MS marketing people, are you seriously this incompetent, or did a new astro-turf campagin incentive get out of

      • Re: (Score:2, Insightful)

        Heh, the comment is completely on topic, interesting and factual and yet we still have fucktards insisting that any mention of MS must have been paid for.

        Get a life.

    • Re: (Score:2, Flamebait)

      by timeOday (582209)
      Oh boy, are they rolling dedup into WinFS [wikipedia.org]? Wow, just look at the timeline [wikipedia.org] in that article if you want to get discouraged about Microsoft delivering a new filesystem.
    • Re: (Score:2, Informative)

      by Anonymous Coward

      Interesting link, but it doesn't look like Microsoft has actually released this yet and it is only slated to be released with Win 8 server, and it will come with some caveats.

      FTFA:
      "It is a server-only feature, like so many of Microsoft’s storage developments. But perhaps we might see it deployed in low-end or home servers in the future.
      It is not supported on boot or system volumes.
      Although it should work just fine on removable drives, deduplication requires NTFS so you can forget about FAT or exFAT. A

    • by SuricouRaven (1897204) on Wednesday January 04, 2012 @04:35PM (#38589248)
      NTFSs file compression actually rather sucks. Space saving is minimal under all but ideal conditions. It's a common problem in filesystem-level compression - the need to be able to read a file without seeking very far, or reconstructing the entire stream. Compression ratio is seriously compromised to achieve that.
      • by 0123456 (636235)

        I compressed some of my Steam games recently to free up disk space on my laptop. If I remember correctly, most of them shrunk by 5-10% while some got around 30%.

        So not great, but it freed up a few gigabytes that allowed me to install a few more smaller games.

      • by jamesh (87723)

        NTFSs file compression actually rather sucks.

        It's fine as long as you use it properly. I use it for IIS logfiles. I want to keep the logfiles but rarely actually access them, and they are append only, and they are plain text. Very high compression at a very small loss of performance.

        Compressing binary data in your working set is, as you point out, probably a bad idea, but as long as you don't do anything stupid you shouldn't have any problems.

    • by dltaylor (7510)

      One instance by default is brain-dead. Lose that and they're ALL gone, if it happens between dedup and backup. You should have at least two copies, on different media, if possible, always, of any data with more than one reference, as well as backing up your data, of course.

    • Both deduplication and conventional compression are a questionable idea on a file server which has many clients.

      The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.

      Yeah right. I'll wait for real usage numbers rather than "the vendor selling this stuff to me says it's fucking awesome".

      • by afidel (530433)
        Since most file servers have about 95% unused processor cycles and a limited amount of disk I/O both compression and dedupe can be significant wins provided they don't create an I/O profile that is a smaller percentage more random than their effective compression (ie if they add 10% randomness to the I/O profile but provide 30% compression then it's probably a net win). The fact that they potentially increase cache effectiveness is just gravy since cache is a few orders of magnitude faster than spinning dis
      • by afabbro (33948)

        Both deduplication and conventional compression are a questionable idea on a file server which has many clients.

        Out of curiosity, why do you say that? I assume what you're saying is true regardless if we're talking Linux or Windows or any OS.

  • by Anonymous Coward on Wednesday January 04, 2012 @03:57PM (#38588772)

    ...includes dedupe.

    There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.

  • by TheRaven64 (641858) on Wednesday January 04, 2012 @04:05PM (#38588868) Journal

    ZFS in FreeBSD 9 has deduplication support. I've been running the betas / release candidates on my NAS for a little while (everything important is backed up, so I thought I'd give it a test). ZFS development in FreeBSD is funded by iXSystems [ixsystems.com], who sell expensive NAS and SAN systems so they have an incentive to keep it improving.

    I have a ZFS filesystem using compression and deduplication for my backups from my Mac laptop. I copy entire files to it, but it only stores the differences.

    • by Anonymous Coward on Wednesday January 04, 2012 @04:57PM (#38589558)

      People considering either dedup or compression on FreeBSD should be made blatantly aware of one of the issues which exists solely on FreeBSD. When using these features, you will find your system "stalling" intermittently during ZFS I/O (e.g. your SSH session stops accepting characters, etc.). Meaning, interactivity is *greatly* impacted when using dedup or compression. This problem affects RELENG_7 (which you shouldn't be using for ZFS anyway, too many bugs), RELENG_8, the new 9.x releases, and HEAD (10.x). Changing the compression algorithm to lzjb has a big improvement, but it's still easily noticeable.

      My point is that I cannot imagine using either of these features on a system where users are actually on the machine trying to do interactive tasks, or on a machine used as a desktop. It's simply not plausible.

      Here's confirmation and reading material for those who think my statements are bogus. The problem:

      http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012718.html [freebsd.org]
      http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html [freebsd.org]

      And how OpenIndiana/Illumos solved it:

      http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html [freebsd.org]

      • by Freultwah (739055)
        So for now, let's not use compression on write-heavy volumes, where it adds next to benefit anyway. Problem solved or at least circumvented, no?
        • by ThorGod (456163)

          So for now, let's not use compression on write-heavy volumes, where it adds next to benefit anyway. Problem solved or at least circumvented, no?

          Yeah, and ZFS is still a major new feature to FreeBSD, regardless. I know people in the FreeBSD forums talk about running ZFS on root, but if you don't have the hardware you shouldn't run ZFS. We're in Unix land here, the OS isn't going to keep you from doing anything...but that doesn't mean you should!

    • by Guspaz (556486)

      ZFS deduplication works on a block level, so it's not only storing the differences. That would require byte-level deduplication.

      For example, imagine I had a block size of two bytes, and had the following two files:

      02468A
      012468A

      These two pieces of data are identical except for one inserted byte, but deduplication will be forced to store both files in their entirety. Why? Because even though only one byte changed, none of the blocks are identical:

      02 46 8A
      01 24 68 A-

      This is a fairly common case in many kinds o

      • In my cases, the biggest files are VM images. In this case, it is only storing the differences, because changing a few files in the VM only modifies a few blocks of the image but causes Time Machine to create a new backup copy of the entire image. This means that it's backing up 10GB files quite regularly, but I'm only storing a few MBs for each one.
        • That's why Apple swapped their monolithic data blobs (for iPhoto, Aperture, etc) in to smaller files when Time Machine was released--like Sparse Bundle Disk Images, for example. Since the data is banded across multiple files, when some data is changed, you only need to back up the affected bands. I believe they marketed this as "Time Machine Aware" and advised third-party apps to adopt this approach. I thought Time Machine's approach was clever given the constraints of a traditional file system (in lieu

  • you could run a nightly script to find duplicates and then deduplicate them.

    an example of this would be find all new files since the last run, checksum them and compare this to the checksum of all previously examined files. Once you find the likely duplicates you can decide how careful you want to be about verifying identity of both data and meta data. For example, do you want to preserve the attribute dates if the data is identical? for some programs the creator dates matter. Likewise file permissions

    • by timeOday (582209)

      an example of this would be find all new files since the last run, checksum them and compare this to the checksum of all previously examined files

      Block-level (rather than file-level) de-duplication would be drastically better for virtual machine images, which are most of the largest files I back up. Email spool files (or pst files) are another example of large files that tend to change only in portions of them.

      • by Guspaz (556486)

        So long as you never add or remove data from the middle of the file, sure. Otherwise a one byte change in either direction can require re-storing all the rest of the unchanged data because it might not line up on the same block boundaries.

        • Not for the two that he talked about. VM disk images contain filesystems, so they are perfect for deduplication (as I'm seeing on my NAS). If you modify 10MB in a VM disk image, then you will modify 10MB of blocks (possibly some more, depending on the filesystem), but you definitely won't be re-storing the entire thing. For mail spools, the common format for spools-in-a-single-file is mbox, where (precisely to avoid the problem you're talking about, where a small change generates a huge amount of I/O), n
          • by Guspaz (556486)

            Certainly, there are many scenarios where block-level deduplication does wonders. It just doesn't work well in all cases. For example, trying to use deduplication to store incremental database backups. Record sizes are not fixed in a sql dump, so you're likely to waste large amounts of space for small differences. I do use ZFS deduplication to store incremental backups from a variety of machines, and generally it works well, but the SQL dumps were one of the annoying bits.

    • by mhotchin (791085)

      Think about things line Virtual Machine images - they could have a *lot* in common, but not everything. DeDupe gets useful once you can work at a sub-file level.

  • FreeBSD has ZFS (Score:3, Informative)

    by grub (11606) <slashdot@grub.net> on Wednesday January 04, 2012 @04:06PM (#38588896) Homepage Journal

    FreeBSD, and FreeNAS which is bases on FreeBSD, both come with ZFS. Neither is going away anytime soon.
    I use both at home and am happy as a clam.
  • by Dwedit (232252) on Wednesday January 04, 2012 @04:09PM (#38588920) Homepage

    I've used LessFS. On my "server" powered by an Intel Atom, it is very very slow. It writes at about 5MB/sec, even when everything is inside a ram disk.
    You can't use a block size of 4KB, otherwise write speeds are around 256KB/sec, need to use at least 64KB.

    • I have a NAS with a 1.6GHz AMD Fusion thingy, which should be in the same speed ballpark as an Atom. It happily got 40MB/s writing to the deduplicated filesystem with FreeBSD ZFS (with a kernel with all of the debug knobs turned on) over GigE. With a release kernel, I'd expect it to be a bit faster, but since I mainly use it over WiFi the bottleneck is generally somewhere else...
      • Re: (Score:3, Informative)

        by BitZtream (692029)

        No you didn't.

        You got 40MB writing to memory cache possibly, not the ZFS store.

        I have a quad disk, 8 core, 8 GIG machine that ONLY does ZFS, Sustaining 40MB/s doesn't happen without special tuning, turning off write cache flushing and a whole bunch of other stuff ... unless I stay in memory buffers. Once that 8 gig fills or the 30 second default timeout for ZFS to flush, the machine comes to a stand still while the disks are flushed, and at that point, the throughput rate drops well below 40MB/s since it i

      • by Guspaz (556486)

        With deduplication, RAM is more important than CPU speed. If you can fit the checksums for every block of data in RAM, then you're golden. If not, deduplication can get extremely slow, no matter how fast the processor.

  • Look, no duplicates!

  • BackupPC (Score:5, Interesting)

    by Anonymous Coward on Wednesday January 04, 2012 @04:11PM (#38588954)

    Check out BackupPC. Been using it for about 5 years at our company, admittedly a mostly Linux shop, with great results. Deduplication on a per-file basis, block-based transfers via the rsync protocol, and a good web-based UI (at least in terms of function). Thanks to deduplication we are getting about a 10:1 storage compression backing up servers and workstations: a total of 1.28 TB of backups in 130.88 GB of used space.

  • Backup or live FS? (Score:3, Interesting)

    by Anonymous Coward on Wednesday January 04, 2012 @04:14PM (#38588972)

    Your post doesn't make it clear if you're looking for a free backup product to replace DataDomain, NetApp, etc. or if you're now wanting to dedup on live filesystems.

    If you're looking for a free backup product that supports deduplication, look at backuppc . Powerful and complex, but free. I've used it for years with good results.

  • dragonflybsd (Score:2, Informative)

    by Anonymous Coward

    So you want dragonfly BSD with a hammer filesystem.
    An excellent and stable BSD and an excellent filesystem to go with it. And a very helpful community.

  • Why not just do the following every once in a while:

    1. go through all your files,
    2. for each file, compute a checksum (e.g. using the unix tools md5sum or sha1sum),
    3. for pairs of files giving similar checksum, compare them (optionally) and if equal remove one of them and make it a hard-link to the other.

    It would surprise me if there was no free open-source script doing exactly this.

    • by shish (588640)

      for pairs of files giving similar checksum

      MD5 / SHA1 / most good hashing algorithms will give completely different hashes for a single bit of data difference

      remove one of them and make it a hard-link to the other

      What then happens when you write to one of the files?

  • by Anonymous Coward on Wednesday January 04, 2012 @04:23PM (#38589106)

    it needs incredible amount of memory to operate effectively.
    from my university notes:
    5TB data, average blocksize 64K = 78125000 blocks
    for each block the dedup needs 320 bytes so
    78125000 x 320 byte = 25 GB dedup table

    use compression instead. (eg zfs compression)

    • by kesuki (321456)

      if you've got 5 tb of files, 25gb doesn't seem so huge. besides you assume that it needs to keep track of every block, when you only need to hash every file, but yes, that is a massive burden on resources. and in implementation it would require loads of ram or a large swapfile, and all just to save disc space over burdening processor/ram requests.

      • besides you assume that it needs to keep track of every block, when you only need to hash every file

        In the context of ZFS, you're working with block-level deduplication.

        if you've got 5 tb of files, 25gb doesn't seem so huge

        Absolutely true! ZFS will happily store its dedupe block lists in any cache device you've added to a pool, so add a $100 SSD to that pool and get a huge performance boost at the same time.

        but yes, that is a massive burden on resources

        Depends what you're using it fore. Suppose you're running ZFS with dedupe on your virtualization farm's backend storage. Each of those VMs will start off with the same base image. Create a new VM by simply copying that base image and you'll only pay for t

    • by m.dillon (147925) on Wednesday January 04, 2012 @04:46PM (#38589414) Homepage

      All dedup operations have a trade-off between disk I/O and memory use. The less memory you use the more disk I/O you have to do, and vise-versa.

      Think of it like this: You have to scan every block on the disk at least once (or at least scan all the meta-data at least once if the CRC/SHA/whatever is already recorded in meta-data). You generate (say) a 32 bit CRC for each block. You then [re]read the blocks whos CRCs match to determine if the CRC found a matching block or simply had a collision.

      The memory requirement for an all-in-one pass like this is that you have to record each block's CRC plus other information... essentially unbounded from the point of view of filesystem design and so not desirable.

      To reduce memory use you can reduce the scan space... on your first pass of the disk only record CRCs in the 0x0-0x7FFFFFFF range, and ignore 0x80000000-0xFFFFFFFF. In other words, now you are using HALF the memory but you have to do TWO passes on the disk drive to find all possible matches.

      The method DragonFly's HAMMER uses is to allocate a fixed-sized memory buffer and start recording all CRCs as it scans the meta-data. When the memory buffer becomes full DragonFly dynamically deletes the highest-recorded CRC (and no longer records CRCs >= to that value) to make room. Once the pass is over another pass is started beginning with the remaining range. As many passes are taken as required to exhaust the CRC space.

      Because HAMMER stores a data CRC in meta-data the de-dup passes are mostly limited to just meta-data I/O, plus data reads only for those CRCs which collide, so it is fairly optimal.

      This can be done with any sized CRC but what you cannot do is avoid the verification pass.. no matter how big your CRC is or your SHA-256 or whatever, you still have to physically verify that the duplicate blocks are, in fact, exactl duplicates, before you de-dup their block references. A larger CRC is preferable to reduce collisions but diminishing returns build up fairly quickly relative to the actual amount of data that can be de-duplicated. 64 bits is a reasonable trade-off, but even 32 bits works relatively well.

      In anycase, most deduplication algorithms are going to do something similar unless they were really stupidly written to require unbounded memory use.

      -Matt

      • You then [re]read the blocks whos CRCs match to determine if the CRC found a matching block or simply had a collision.

        I know you already know this, but for the benefit of other readers: ZFS gives you the option of using no deduplication, using SHA256 as the checksum and trusting its results, or using SHA256 but verifying all collisions. I can't really see the point of not verifying collisions in "strong" hashing algorithms, though - it's not like you're going to get SHA256 collisions so frequently that you can't afford to explicitly verify them.

        • by m.dillon (147925) on Wednesday January 04, 2012 @06:15PM (#38590374) Homepage

          Well, I can tell you why the option is there... it's not because of collisions, it's there to handle the case where there is a huge amount of actual duplication where the blocks would verify as perfect matches. In this case the de-duplication pass winds up having to read a lot of bulk-data to validate that the matches are, in fact, perfect, which can take a lot of time verses only having to read the meta-data.

          Just on principle I think it's a bad idea to just trust a checksum, cryptographic hash, CRC, or whatever. Corruption is always an issue... even if the filesystem code itself is perfect and even if the disk subsystem is perfect there is so much code running in a single address space (i.e. the KERNEL itself) that it is possible to corrupt a filesystem just from hitting unrelated bugs in the kernel.

          Not to mention radiation flipping a bit somewhere in the cpu or memory (even for ECC memory it is possible to get corruption, but the more likely case is in the billions of transistors making up a modern cpu, even with parity on the L1/L2/L3 caches).

          Hell, I don't even trust IP's stupid simple 1's complement checksum in HAMMER's mirroring protocols. Once during my BEST Internet days we had a T3 which bugged out certain bit patterns in a way that actually got past the IP checksum... we only tracked it down because SSH caught it in its stream and screamed bloody murder.

          If you de-duplicate trusting the meta-data hash, even a big one, what you can end up doing is turning 9 good and 1 corrupted copies of a file into 10 de-duped corrupted copies of the file.

          I'm sure there are many data stores that just won't care if that happens every once in a while. Google's crawlers probably wouldn't care at all, so there is definitely a use for unverified checks like this. I don't plan on using a cryptographic hash as large as the one ZFS uses any time soon but being able to optimally de-dup with 99.9999999999% accuracy it's a reasonable argument to have one that big.

          -Matt

    • by m.dillon (147925)

      Another side note on DragonFly's HAMMER: de-duplication is implemented both as a daily pass AND can also be enabled for live writes. The daily pass can find all duplicate blocks. The live dedup uses a small fixed in-kernel-memory LRU style record of recent data block CRCs to find de-duplication candidates during live writes. Performance impact is minimal either way as recently recorded CRCs also tend to still have their data in the buffer cache.

      The live-dedup mostly exists to get some up-front deduplica

    • by afidel (530433)
      So for my 14TB of backup data I need ~70GB of ram, cool since 144GB is my most common build these days, it's really cheap now that 8GB DIMM's are only ~$200.
  • by jdavidb (449077) on Wednesday January 04, 2012 @04:26PM (#38589134) Homepage Journal

    I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication [wikipedia.org]

    Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.

    • Re: (Score:2, Interesting)

      by BitZtream (692029)

      Seriously, at this point on slashdot its been talked about enough that unless you bought your UID from someone, you should be fully aware of what it is from here alone.

  • I have recently written some personal software that finds duplicate files/directories. I had no idea there was such a demand for something like this (and price for such software).

    There are a few hurdles with deduplication that a piece of software will likely need your input for:

    - Does something like the directory/file names matter? If so, does case matter? What about comparing names from ASCII Unicode? Do you compare the DOS 8.3 names also?
    - What do you want to do when you find duplicates? Delete the d
    • by vux984 (928602)

      If its not 100% transparent to the user its not good enough. As far as the user is concerned deduplication isn't happening. There are no evident changes to the filesystem they are using (no case changes, no meta data issues, no links ... 100% transparent to the user )

      Its done at block level, below the filesystem.

      Whether the data being deduplicated is a directory, an alternate stream, the file data, meta data... is irrelevant.

      Say you've got a nice big 5GB Outlook.pst file. And you make a backup copy of it, a

  • write a script (Score:4, Interesting)

    by MichaelSmith (789609) on Wednesday January 04, 2012 @04:34PM (#38589220) Homepage Journal

    md5sum `find . -type f` | sort
    ...and so on

  • I just use rsync from the command line to do deduplication. Been working like a charm for years.

    First I sync from the remote directory to a local base directory:

    rsync --partial -z -vlhprtogH --delete root@www.mydomain.net:/etc/ /backup/server/www/etc/base/

    Then I sync that to the daily backup. Files that have not changed are hard-linked between all the days that share them. It very efficient and simple, and retrieving files is as simple as doing a directory search.

    rsync -vlhprtogH --delete --link-dest=/ba

  • BackupPC (Score:3, Informative)

    by danpritts (54685) on Wednesday January 04, 2012 @05:14PM (#38589738) Homepage
    backuppc is backup software that does file-level deduplication via hard links on its backup store. Despite the name's suggestion that it is for backing up (presumably windows) PCs, it's great with *nix.

    http://backuppc.sourceforge.net/ [sourceforge.net]

    Its primary disadvantage is the logical consequence of all those hard links. Duplicating the backup store, so you can send it offsite, is basically impossible with filesystem-level tools. You have to copy the entire filesystem to the offsite media, typically with dd.

    It also can make your life difficult if you're trying to restore a lot of data all at once, like after a disaster. You take your offsite disks that you've dd' copied, hook them up, and start to run restores.

    The hard links mean lots and lots of disk head seeks, so you are doing random i/o on your restore. This is really slow. If I ever have to do this, my plan is to buy a bunch of SSD's to copy my backup onto. Since there are no seeks on SSDs it will be much faster.
  • by Zemplar (764598) on Wednesday January 04, 2012 @05:42PM (#38590054) Journal
    Yep, put a nail in OpenSolaris' coffin. Instead, I use and recommend OpenIndiana [openindiana.org] and NexentaStor [nexenta.com] (or Nexenta's community edition if you prefer).
  • by ploppy (468469) on Wednesday January 04, 2012 @05:48PM (#38590118)

    Try Squashfs which creates deduplicated and compressed filesystem archives (http://www.linux-mag.com/id/7357/ for a good journal article).

    If you're using Ubuntu, Debian, Fedora Squashfs will be already built into your distro kernel, and the squashfs-tools will also already be available in your distro repository.

  • by Urban Garlic (447282) on Wednesday January 04, 2012 @06:01PM (#38590234)

    There's a form of deduplication supported by the Linux kernel, if you use the logical volume manager. If you create base LVM device, and then create a snapshot of that device, the snapshot only requires sufficient real estate on the host physical volume to store the diffs between the snapshot and the base. You can use this for "freezing" a file system to do back-ups, or for incremental back-ups, or whatever.

    My rather limited experience with this is that, if you have more than a few snapshots on a base device, your write performance degrades very raplidy. There's also a hard limit of 255 snapshots per device.

    You can also do file-based deduplication with the "rsnapshot" tool, which has been available for many years.

    Also also, I haven't kept up, but I seem to recall that ZFS for linux was promising this as a major selling point.

  • by funkboy (71672) on Wednesday January 04, 2012 @07:49PM (#38591190) Homepage

    OpenBSD has had the Epitome [openbsd.org] deduplication framework for some time. I believe version 2 is considered production-ready.

  • by smash (1351) on Thursday January 05, 2012 @01:17AM (#38593130) Homepage Journal
    If you don't think opensolaris has a future fair enough. FreeBSD does. FreeBSD currently supports ZFS v28, which has dedup. Be aware you need plenty of RAM.

The reason that every major university maintains a department of mathematics is that it's cheaper than institutionalizing all those people.

Working...