Ask Slashdot: What's a Good Tool To Detect Corrupted Files? 247
Volanin writes "Currently I use a triple boot system on my Macbook, including MacOS Lion, Windows 7, and Ubuntu Precise (on which I spend the great majority of my time). To share files between these systems, I have created a huge HFS+ home partition (the MacOS native format, which can also be read in Linux, and in Windows with Paragon HFS). But last week, while working on Ubuntu, my battery ran out and the computer suddenly powered off. When I powered it on again, the filesystem integrity was OK (after a scandisk by MacOS), but a lot of my files' contents were silently corrupted (and my last backup was from August...). Mostly, these files are JPGs, MP3s, and MPG/MOV videos, with a few PDFs scattered around. I want to get rid of the corrupted files, since they waste space uselessly, but the only way I have to check for corruption is opening them up one by one. Is there a good set of tools to verify the integrity by filetype, so I can detect (and delete) my bad files?"
Linux Command: file (Score:1, Insightful)
Try running "file" from a command line on a few files you know to be corrupt. If the file command tells you the same, you could run a quick bash script to loop through the files and spit out the names of the bad ones. This is all assuming you know what you are doing with shell scripting.
No easy answer (Score:2, Insightful)
1. Compare to backup, files that match are ok.
2. AppleScript option others mentioned may help reduce it further.
3. Backup regularly, and verify your backup procedure.
4. Anything else will cost you consulting rates.
Re:AppleScript (Score:4, Insightful)
But the open usually won't fail. Unless the error is within the header bytes of a movie or image, the media will open, but will appear wrong. Worse, there is no way to detect this corruption because media file formats generally do not contain any sort of checksums. At best, you could write a script that looks for truncation (not enough bytes to complete a full macroblock), or write a tool that computes the difference between adjacent pixels across macroblock boundaries and flags any pictures in which there is an obvious high energy transition at the macroblock boundary, but even that cannot tell you whether the image is corrupt or simply compressed at a low quality setting with lots of blocking artifacts.
The short answer, however, is "no". Such corruption can't usually be detected programmatically.
Re:compare them to an intact backup (Score:2, Insightful)
Consider the possibility that the backup already contains corrupted files. I once had defective RAM where only one bit flipped occasionally. The machine was quite stable, so the defect went undetected and over a couple of months it silently corrupted hundreds of files. Unless he finds out what caused the crash, he can't be sure that the backup is alright.
Check why the files are corrupted (Score:5, Insightful)
I'd be asking myself why lots of files became corrupted from one dodgy file system event. Assuming HFS works like file systems I'm more familiar with, it will allocate sequential blocks for files wherever it can. This means that a random filesystem splat is really unlikely to corrupt loads and loads of files. You might expect a file system corruption to cause a load of files to go missing (if a directory entry is corrupted) or corrupt a few files, but not put random errors into loads of files.
I'd check to see whether files I was writing now get corrupted too. It might be dodgy disk or RAM in your computer.
The above might be complete paranoia, but I'm a paranoid person when it comes to my data, and silent corruption is the absolute worst form of corruption.
For next time, store MD5SUM files so you can see what gets corrupted and what doesn't (that is what I do for my digital picture and video archive).
Re:compare them to an intact backup (Score:5, Insightful)
Well...
My first suspicion would be that the filesystem is messed up, not the actual files. Unless s/he had a lot of pending writes to all of these files, there is no reason that something should have actually overwritten or garbled them when the power shut down. Much more likely was an impending or in-progress write to the filesystem's tables, which has affected where it thinks all the files' pieces are stored. And if that is the case, date modified and size may be irrelevant because those are going to be reported by the filesystem.
Aside from trying to read back sector-by-sector data and assembling them, however, I don't know that there's a remedy.
Re:Newbie question hour? (Score:2, Insightful)