Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Unix Operating Systems Software

Data Recovery from ReiserFS RAID Array? 62

Ruatha asks: "We've recently had a problem with a ReiserFS RAID-5 array - two of the disks failed and, of course, some of the people using the array didn't have backups of their data...Ontrack have returned the disks because they can do nothing with them due to the FS we used on the array. Does anyone know of a company that can deal with data recovery from a ReiserFS RAID-5 array?"
This discussion has been archived. No new comments can be posted.

Data Recovery from ReiserFS RAID Array?

Comments Filter:
  • Can the users who were not responsible enough to back up their data work some overtime and recover it the hard way? It doesn't seem to make much sense to go to all that trouble and expense to have an outside source recover data on the disks, when the end users showed lack of responsibility in the first place. Was there any sort of policy like this present at your company?
    • Often people responsible for backing up the data are different from the people responsible for creating the data.

      Just a thought.

      -Bill
    • Re:Responsibility (Score:4, Insightful)

      by foobar104 ( 206452 ) on Wednesday September 18, 2002 @01:00AM (#4279293) Journal
      Repeat after me, please: The purpose of IT is to help users, not the other way around.

      If an employee at this worthy's company lost data, it is the responsibility of the IT department to attempt to recover that data, within reason. That's what the IT department is for. This is a sensitive subject to me, because the IT department at my company closed down the IMAP port on our firewall tonight for what they called "security reasons," despite the fact that (1) we've been running IMAP over that connection for years now, and (2) the connection is encrypted with SSL. It literally took my yelling into the ear of the CTO over the phone, after calling him at home late in the evening, to get this problem fixed. The pervasive attitude of indignant hostility from IT departments in all sorts of industries is really starting to burn me up.

      If you worked for me in my IT department, and one of your RAIDs failed, and I had un-backed-up data on it, the only answer I'd want to hear from you is, "Yes, sir, we'll do the best we can and get right back to you." If I even heard a hint from you of the "you were irresponsible so it's not my problem" vein that you showed in your post, you'd find yourself being escorted out of the building carrying your stuff in a cardboard box. And we'd expect you to return the box.

      So just keep repeating it to yourself: The purpose of IT is to help users, not the other way around.
      • so users who shirk defined job responsibilities can foist the blame onto their unknowing IT departments?
        I find that pill a little hard to swallow. I would however examine the chain of decisions that allowed the implementation of what would appear to be an 'unsupported' FS for what is valuable, if not critical data.

        Note: Now I know that ReiserFS is an actual standard with a vocal and dedicated following, and it servers a useful purpose being a journaling alternative to ext3. But if it cannot be accomodated in your disaster recovery plan, it would have to be considered 'unsupported' against that measure.

        All in all, I would have to agree that the critical misstep was in the bailwick of IT, but I would hesitate to call it a failure in execution, it was a failure in planning, which has deeper roots in project management and communication.

        But hey, someone's gonna burn some oil trying to get back those numbers. good luck to the original poster.
      • Re:Responsibility (Score:3, Interesting)

        by dubl-u ( 51156 )
        This is certainly true, but you should consider the flipside of it. The typical way it works with IT departments is that they are given unfunded mandates right and left. There is no possible way they can do everything with the money they have. What should happen is that some stuff should be taken off their plate. But they rarely have the political pull needed to do that, so what actually happens is that either everything is done poorly or the IT guys work on what they think is important.

        So before you go pointing fingers at the IT department's attitude, it would be good to ask, "Did they tell the managment that they needed a way to back up those machines? And did the managers give them the necessary time and funds?"

        Every IT person I know with a bad attitude has didn't start that way; they acquired it through years of crappy management.
      • Well, if I would work for you (thank god I'm an employer, not an employee) and you would show this kind of attitude, I would tell you to piss off.

        ..but I guess that's the american way of dealing with your subordinates.

      • SSL? May have to do with a OpenSSL security problem.

        I have shut down access in order to fix a security problem. You can yell all you want, but if your admin is responsible, you'll be made to wait for things to be fixed, recompiled and tested before you get back access. That said your admin should explain what the security problem is.

        If your admin doesn't care, then you get it back up immediately, but just risk losing access to lots of other things, whilst a fix is made.

        If your admin is good and you're the boss, you get a reason why, and you are given a short brief summary of the risks and so you get to decide what to do. If you're not the boss, the admin shouldn't open holes NOR keep nonpolicy holes open for you.

        Things become more interesting if there's no fix...

        True, IT is to help users. But though yelling may seem to work, I'm not so confident about the long term effects.

        Admins could take the easy way out to avoid your yelling, rather than do what they should do.

        Cheerio,
        Link.
    • You think this Reiser RAID 5 setup was on the user's workstation? doubt it

      I think users were being responsible by saving to the server. How were they supposed to know that the IT staff's disaster recovery plan was going to fail?

      If they are going to a data recovery service then I have to assume there aren't any tape backups. Were they assuming that RAID 5 is a disaster recovery solution by itself? I have seen drives go out slowly and backups were corupt and/or incomplete. This is why you have to verify your backups regularly.

      2 drives out at the same time seem a little suspicous. Probably one drive was down for a while. Perhaps they need to monitor the status of the array. Find better methods if it was monitored. I remember a story during a certifcation class from the tech for Standby Server (Netware clustering). He got called in when the 2 server cluster went completely down and was chewed out all the way to server room in front of all the bigwigs. After he checked it out, it had failed over 4 months ago and now the second box died too. Extremely stable boxes can get you too complacent about monitoring them.

      Maybe the higher powers wouldn't give them money to buy a tape drive to back it up. IT could teach them a lesson by sending the drives to a data recovery so the big boys see that a tape drive cost is nothing compared to the alternatives. IT ends up taking the heat because they used a filesystem that the data recovery people couldn't deal with. Even if they were working under these conditions, there are things they could have done to minimize the problem. Ext3 seems a little better now in this situation even if there was a performance hit.

      Perhaps IT was just being lazy/stupid/etc. I don't know. I don't have all the facts so I will just have to say "what if".
    • Nah, sounds like a hardware bork if two of the three drives went. The *real* problem is a lack of reporting tools for the state of drives in raid arrays. Let's face it, if it's anything more complex than "drive has red light lit up, must replace" then you're a bit screwed to find out what's going on.

      And yes, I would like to know how it's supposed to be done.

      Dave
  • Could they have recovered from ext3?

    Dave
    • by 0x0d0a ( 568518 ) on Wednesday September 18, 2002 @01:30AM (#4279390) Journal
      Could they have recovered from ext3?

      Very likely, yes. Ext2 recovery techniqes are well known, documented, and tools exist (if rudimentary) for recovering files from it. I believe that this translates well to ext3.

      Also, if you want someone to recover the data and you're willing to drop some money on it...I strongly suspect that there are people on the reiserfs team that would take on a recovery job quite happily. No one knows reiser better!

      That being said, there is currently no good, easy-to-use, powerful recovery tool for Linux filesystems, rather depressingly. Now, you *could* argue that this is because the filesystems are so great, but even so...
      • Very likely, yes. Ext2 recovery techniqes are well known, documented, and tools exist (if rudimentary) for recovering files from it. I believe that this translates well to ext3.

        Translates exactly to ext3, in fact. If you ignore the journal, ext3 is ext2.
    • by Dashslot ( 23909 )
      Yes. I once managed to trash my /home partition. (Ironically, I was upgrading from ext2 to ext3 and I ran mke2fs -j instead of tune2fs -j!)

      To recover the files, all I did was..

      1) Use dd to make a copy of the partition

      2) Consult /etc/magic for the headers of the file types I wanted.

      3) Used a binary grep utitlity to find the starting point of those files

      3) Used dd to grab an arbitary sized chunk of the partition

      4) Found out that after exactly 48k of each file there was a 4k block of junk

      5) used dd to assemble a new file with the 4k of junk missing

      6) Found what end tag each file type used and remove the junk after it (with dd of course). Alternatively, I could have opened and resaved the file in whatever app that could handle it, and that would have trimmed the junk as well.

      It took about 2 hours total to figure out how to do it, despite having no previous experience with filesystem debugging/etc

      Of course, I wasn't using RAID, but I would bet that your chances of recovering ext2/3 stuff are much much higher than reiser, whatever the underlying hardware.
  • from google --> datarec [datarec.com]
  • Am I the only one wondering how 2 disks failed at the same time? What happened? Power surge and no power strip or ups? I don't really want to point fingers, but the odds of two disks failing at the same time seems really, REALLY, freaking slim. Like if this happens to you, go out and buy a lotto ticket.
    • Its not that uncommon. Same thing happened to an array at the University i study at. Remember, the disks are often equally old, and the chance of several disks in an array failing the same years are quite good. When several often fails within the same _year_, chances are pretty good two of them will fail in the same night.

      Not that it happens OFTEN, but it DOES happen, and more often than winning in lotto.
      • Re:What happened? (Score:3, Interesting)

        by photon317 ( 208409 )

        Simultaneous disk failure is about as rare as winning the lotto while simultaneously having you and your friend on the other side of the planet get struck by lightning - unless there's a common larger problem (e.g. power surge to both drives or something).

        As a practical example, 4 or 5 years ago I had large amount of disks attached to some large oracle servers, roughly on the order of 600 or so hard drives in several arrays taking up several racks, all the same manufacturer/model, with a handful of groupings of revision/lot/date.

        This set of disks was seeing fairly constant and heavy activity for a few years while I was there. As you can imagine, with 600 disks and the usual MTBF numbers, we quite regularly had disk failure. We kept a few spares onsite and replaced them as they failed, then exchanged the dead drive for a new spare. As I roughly remember it, we probably averaged about one disk failure every 2-3 weeks. Two, perhaps three times, we had a double disk failure during a 24 hour period - but they were never close enough that we didn't have plenty of time to replace the first (and in any case, odds are slim that two failed out of 600 would happen to affect the same data).

        Of course, another point back at the original guy with the failed disks - don't use raid 5, chunk out some more money (disks are cheap) and do proper mirroring - and if you stripe use 1+0, not 0+1.
        • I've had multiple disks in a RAID go at once; in practice, failures might not be totally random events (due to normal wear and tear), but hardware bugs in various subsystems, or power supply related problems, or heat problems, etc. In essence, the physical (and electrical) proximity of all the drives seems to elevate the chance of multiple simultaneous failures.

          • Again, those (heat, power) are external events though, not random failure, as in what the MTBF is indicating. Controlling the external factors is the sysadmins' and datacenter ops peoples' job. As for wear and tear, no it won't cause simul-failure. As for hardware bugs - if they're in the drive's firmware it's remotely possible, but I don't know... never heard of it happening.
        • Re:What happened? (Score:2, Interesting)

          by bogeskov ( 63797 )
          At the company i work at, we had a scheduled power outage (cable checking in the ground outside the building, and possible replacement). As a result we went to a minimum of running systems. No need to drain the diesel level that fast :-), so we powered of the systems that weren't needed through the night.

          Next morning we powered everything up, and on all systems where the disks were more than 2.5 years (and the disks had been running for all that time), we lost about 20-25% of the disks. We've been told that it was due to the fact that the disks actually went cold, and the reheating broke the disk platters. On most of the disks you could hear within 2 hours of power up the heads grinding against the platters, since the platters were wobbleing.

          As a result more than one system had multiple disk failures, but the backup was good so no problem (except for the deliverytime of 4 hours on the disks).


          • Yeah - lesson #1, never power off your disk arrays until you're about to throw them away. Should've brought in a generator. Making the mistake of letting your drives go cold is an external circumstance that will cause multiple failure.

            • Footnote: If you do ever suffer failed drives from this scenario (drives have been running for extended periods, like months to years, you power them off and allow them to cool, then try to use them again), there's a trick to reviving them that works a lot of the time:

              Hold the drive in one hand, with the flat metal surface which covers the disc platters facing up, and give it a good solid whack with the handle of a screwdriver. Another variant is to drop it on the floor from a height of 1-2 feet, again hitting the metal cover surface over the platters, not the circuit board side or the edges.

              I've seen many an IBM technician use this technique on service calls succesfully, and I've done it myself a few times, it does work. The supposed theory is that the heads get stuck in the parked position, and you're jarring them loose again.
              • I've had good luck resurrecting drives with these less destructive steps:

                1. Warm the drives back up to their usual operating temperature. I find that an hour or so in a car on a sunny summer day works well for this.
                2. Holding the drive cover side up, quickly turn it clockwise and counter-clockwise.
                3. Hook it up, and try it while it is still warm.

                No need for sharp impacts, just a quick twist after warming the drive back up has worked for me many times.
    • I have a theory. I have no idea who posted this question or anything about them, but I saw this same thing happen once, and I bet this is their story: the RAID dropped one disk, but the same person who was in charge of their backups was also in charge of monitoring the RAID, i.e., nobody. So the RAID ran degraded for a while, then poof. Dropped another disk. All hell breaks loose. Film at eleven.

      Like I said, I've seen it happen. I'll give you three-to-two odds that that's just how it went down.
      • Heh, I was working on a bit LDAP based site once. We had two LDAP servers and a load balancer to fail one out if it shit itself. Couple of months in, come in one morning to discover LDAP down and the site wiggling it's feet in the air. Why didn't the load balancer fail over onto the other LDAP?

        Once we'd done the autopsy we discovered that it did.... days ago and nobody noticed. Hence when the second LDAP went as well we were left high and dry.

        Dave
      • We do systems for small businesses where there is no IT staff. We use those RAID controlers with alarms. When one of the disks fails the noise is so loud that we get a call within 1min.
      • This is why you have a spare disk in the array that automaticaly gets used if a drive fails.
        • This is why you have a spare disk in the array that automaticaly gets used if a drive fails.

          Except, of course, nobody put a gun to your head and made you configure hot spares when you set up your LUNs. They're entirely optional. It's very easy to imagine a scenario in which a system administrator either neglected to or didn't realize he had to configure one or more hot spares. Hell, I don't have to imagine it. I see it all the time.

          Also, in some cases RAID set rebuilds impose a significant performance penalty. A RAID may not be configured to start a rebuild automatically. Instead, it's expected that a system administrator will be aware of the failed disk and either physically replace it or use the hot spare, then start a rebuild overnight or something like that. In that situation, all the hot spares in the world won't help you if nobody logs in and starts the rebuild.
    • It's not too rare at all, and stresses the importance of having good backup power availability and quality storage hardware.

      If disks run continuously, 24 hours a day for months and months, they are very likely to see one or more failures when they are powered on and off.

      The "very unlikely, buy a lotto ticket" problems occur if you get a bad lot of disks. I was at a customer site last year where several slowly failing disks corrupted the partity of a RAID-5 volume in an EMC array and wiped out the data when one of the disks eventually failed. All data on the volume was lost. I turns out there was a manufacturing problem with the disks, and all of these disks were packed on the same day.

      • Interesting -- it seems then that a decent strategy would be to mix disks from different manufacturers.

        That and adequate alarming.
        • Re:What happened? (Score:3, Informative)

          by duffbeer703 ( 177751 )
          That is not a smart strategy.

          Alot of storage is very finicky -- and odd low-level problems emerge when you mix disk manufacturers or even disk models in some cases.

          When Sun 5200 fibre arrays first came out, if you got a batch of Sun-branded drives that were manufactured by different vendors, you would have all sorts of odd locking issues and other goodies.

          A good strategy for reliable storage:

          - Don't be cheap. You get what you pay for.
          - Plan for Disk-to-Tape (or DVD) backup, actually test restores regularly-
          - Use well-supported, stable versions of filesystems
          - Don't buy the latest and greatest unless there is a business reason to do so
  • by arcade ( 16638 ) on Wednesday September 18, 2002 @01:33AM (#4279395) Homepage
    First of all, since _two_ disks got screwed at the same time, you've lost the "normal" chance of getting the data back. RAID-5 ensures that if ONE disk fails, you can get the data back due to the parity-stuff - but not if two disks fail. You probably know this.

    So, what needs to be done, is to get one of the disks back online again. That *should* be possible, and if nothing really really bad happened, should, i think, in theory, be as easy as getting a lab to pull the disk-platters out and put a new motor / new electronics on them. I'm not sure about this though ;)

    Preferrably it could be done by the disk-manufacturer.

    You could also check out an excellent company in Norway called IBAS. Check out http://www.ibas.com/america/index.html [ibas.com] for their american office. They are really excellent at data reconstruction.

  • Comment removed (Score:5, Informative)

    by account_deleted ( 4530225 ) on Wednesday September 18, 2002 @01:39AM (#4279412)
    Comment removed based on user account deletion
    • Re:quick fix (Score:3, Informative)

      by ChadN ( 21033 )
      I've had drives which were fixed by the same method; apparently heat is a major cause of this. I would also suggest trying to operate the drives in a cold room (or with some suitable extra cooling), to see if they'll work long enough to recover the data.
      • In systems like mine, with 3 HD's stacked one on top of the other, a cold room would not be enough. Heat dissipation was bad, and the HD's would quickly be too hot for the touch. I did no measurements, just put my hand there and felt. But it should be about 60 C or more.

        I noticed that each HD model heats more or less. So I got worried about this, and bought HD Coolers. Huge difference. Now the HD's just get warm, not hot. Maybe it's not necessary when you're running just one HD, but when you have more than one, I think it's really worth it.

        You can find HD coolers in lots of places, but I suggest you do a Google search [google.com.br].
    • This is a very good idea, and if it doesn't help the situation then there may be more than just electronics failure involved.

      A few years ago I worked at a company where a RAID array in another group's server had one of those "win the lottery and get hit by lightning in one day" moments. Our own IT person got suspicious and started asking questions. It turns out that they had a single drive failure, but when they put in the replacement drive the guy had moved the remaining drives around in the bays because he wanted the newest drive in a particular position. This totally confused the poor controller.

      Instead of moving the drives back, he reformatted. And the last backup tape was bad, and the one before that. The weekly was only three days old though, and that one was readable. So all's well that ends well I guess.
  • Freeze it (Score:2, Interesting)

    by jsantala ( 139130 )
    I just heard a tip from a hardware guy. He says you can "freeze" a drive in a freezer to make it work for that one last spin. You'll have to be fast though and copy the stuff quickly somewhere else. After it warms up again its gone for good.

    He told me this since his laptop hard drive went bad a few days ago and he had succesfully used the trick on it, saving all the data one would have thought lost.

    According to him this should work on both drives with broken electronics or broken mechanics. He said he has many theories why it works, but has proven none. YMMV.
    • Re:Freeze it (Score:2, Interesting)

      by Keith_Beef ( 166050 )

      Beware of the condensation that will form when you take the drive out of the freezer into a warm room... and the time it takes to connect the data and power cables...

      It's probably a good idea to get the manufacturer's advice on the operating temperature range, too.

      If you really want to try the freezer method, I suggest that you connect the cables to the drive, then put the drive in a plastic bag (the type that you use for freezing leftovers). Make some holes to pass the cables to the outside, and seal the bag around the cables. Put a small bag of silica gel inside the bag, and seal the bag.

      If the cables are long enough, you should be able to put the bagged drive inside the freezer, and leave the cables dangling outside.

      If the cables are too short, then you could use those blue freezer blocks (sold for cool-bags) to lower the drive temperature, and keep the drive a bit cooler while you're recovering the data.

      Or if you want to go the whole hog, use some polystyrene to insulate a cardboard box, put the bagged drive in the box, along with a block of dry ice (cheaper than mineral water).

    • Re:Freeze it (Score:2, Informative)

      by Komarosu ( 538875 )
      I may get shouted at for slander (we've been "warned" by our suppliers and fujitsu) but... We've had a few (more like 100+) fujitsu drives fail over the last few months with "known manafacturing problems", and its becoming more of a headache for us and our supplyer. We've worked out that if you leave the hard-drive to cool for a good few days then you will get a final 15-20mins out of it, enough for using Norton Ghost or DD to image the drive. Note: just look for Fujitsu MPG or MPAT 10gb drives...
  • by ComputerSlicer23 ( 516509 ) on Wednesday September 18, 2002 @04:02AM (#4279838)
    Assuming, it's really, really necessary, there are ways of gleaming data off of a drive that has died. You can send your drives off to said company and for a thousand dollars a disk they take it into a clean room, disassmeble the entire platter assembly, physically prepare the platters so they are as good as can be, then put them under their own super duper head assembly.

    Assuming you can find one of them, use google for gods sake. Get them to read the drive linearly. At this point, they should be able to put each drive onto new drives for you. Build a raidtab that matches the new drives they give you, do raidstart /dev/mdXX, do a fsck.reiserfs on it. Mount the thing. Poof like magic you have data again.

    I've heard of several people recovering Ph.d and masters disertations this way. It's why the DoD has such strict standards about destruction and disposal of a hard drive. It's very difficult to delete data from a drive so it can't be recovered. It's can just get out of control expensive to recover the data.

    If it's worth $5-10K, and you can deal without having it for several days, this might actually work... If the company bitches, just tell that it's about what a good tape setup would have cost if they had bought one.

    The two guys who said try stripping the electronics are pretty brave SOB's, but that might be worth trying if you can't spring for some experts to deal with that for you.

    Kirby

    • That's exactly what the poster did.

      The problem is, instead of using a well-established filesystem like ext2, jfs, xfs, etc he choose to put critical data on ReiserFS, which changes so often the data recovery companies do not know how to recover from it.

      He should try another company, and if that fails he should take a good look at the configuration choices that he makes.
      • Hmmm, I'm not terribly sure on this. Follow the link on the story. It looks like what he was using was a rinky dink program to essentially look for inodes before they were written.

        I went and poked around some more, you might be right. It looks like they have several levels of support, one of which is some crappy shrink wrapped software. Not much it can do for broken drives. The other is a cleanroom pull apart setup like what I mentioned...

        To be honest, they are idiots if they can't get enough of the filesystem off of there. I'd put money on the fact, that xfs and jfs are no better supported then Reiser. Reiser I believe is at least a secondary install choice on Suse Linux. I believe XFS is only the on the SGI rebranded RedHat. Meanwhile, I don't believe JFS is used by default on any system.

        If they can physically pull what use to be the bits on the disk, with any reliability at all (which if they can't it's useless). At that point, I should be able to mount and retrieve a lot of data. That in and of itself would be of extreme value to the poster. Then you should be able to handily mount/fsck it, and get the data. I know I've done it before. Never with ReiserFS, but with ext2 on a drive that was fried on certain sections. Just used mount and tar, and I was done, I got 95% of the data off just with that. I'm not sure which of those groups they are, but all things considered just pulling the bits from the disk should be more then enough assuming the media isn't just totally fried to get the majority of the information off the disk.

        • Remember that IRIX and AIX use XFS and JFS, respectively.

          If you sent corrupted volumes to a shop that can recover commerical Unix disks, they should be able to recover data on a Linux box. JFS=JFS.
  • by martin ( 1336 ) <maxsec.gmail@com> on Wednesday September 18, 2002 @04:58AM (#4280003) Journal
    For me the guys are www.vogon-data-recovery.com

    There are others, but I always seem to see these guys at the forefront...

    just my 2 pence..

  • It is very obvious, but the same guys that develop ResierFS offer support for this kind of problems.

    Go to Namesys.com, and purchase support there. Maybe they can help you.
  • I work on the design of hardware RAID controllers and the original post is missing a key piece of information -- are the two drives really bad, or were they just marked as bad by the RAID driver? I ask because often (particularly with parallel SCSI drives) you have a bad connection or a bad drive which causes one drive to hang, which makes the RAID set go degraded, but since the bus is still corrupted the next drive it tries to talk to will also appear be bad so the array is marked offline. This happens at least ten times as often as genuine double drive failures. How the RAID software reacts to double drive failures depends upon the author. You should have some kind of log or console printout -- the timestamp of the errors is the tell-tale clue, if the failures happened within a minute or so then you can be pretty sure that only the first failure is real.
    First thing's first -- put in a set of scratch drives and see if the bus and HBA is working ok. Test it thoroughly! Then, using a read-only tool like IOMeter, check each of your original data drives individually to see if it can read reliably across the platters (If using NT for this test, do NOT let DiskAdmin write signatures on the drives!!). Hopefully one of the drives will be bad -- if so, set it aside. Perhaps you are certain you know which was the first drive to fail -- remove that one if you know it. Reboot, and see if the array metadata is recognized. If not, concentrate your efforts on fixing the second drive to fail. If there is a gap in time between the first and second drive failures then the data on the first drive to fail is no longer of any use to you as it is out of sync with the other drives.
    If you have more details please post them here and I will try to give you more detailed advice.
    One other soapbox comment -- people who sell RAID technology should always provide some kind of metadata debugger because, as they say, sh!t happens.
    • Agreed, I see this fairly often. However, I see genuine dual drive failures too.

      No, the drives don't both fail at the same time. More typically the array is attached to a "custom" built system that does not have adequate alarming. In this case one drive fails and no one realizes it has failed. The system continues to operate for an extended period of time, possibly many months, and then a second drive fails. At that point it's all over.

      Now this is rare with mainstream systems like Compaq/HP and Dell. With these systems a drive fails and the server emits an ear piercing alarm that cannot be stoppped without rebooting the system or replacing and rebuilding the drive. It's absolutely maddening especially if you have to wait for several hours with this alarm screaming away. But, it makes genuine dual drive failures on these systems unheard of.

      • Most of the Dell controllers have a button on the controller to disable the siren. I was trapped in a closet working on a PowerEdge 2300 when I discovered that one. The poor users whined about the noise when it was quite a way away from them and behind a door. Poor babies.
  • I lost 40gigabytes (of mostly irreplaceable!) data on a 40gb UFS filesystem.

    It seems I am SOL.

    I have searched and searched and SEARCHED high and low on google and other search engines. There appears to be no UFS undelete tool. extfs undelete tools exist, but nothing for UFS. :(

    It was actually a FreeBSD 4.5 FFS filesystem to be exact, using soft updates. I unmounted the filesystem within seconds of realising it was gone. (and no, i didn't do rm -rf, it was a bad mistake of find and --delete)

    So... does anybody know, or do i just give up?

    D.
    • It's theoretically possible to do some undeletes manually using fsdb(8). I believe I saw a faq entry or handbook entry or something on one of the BSD user help sites about the general idea of how to recover form this, but... Pretty much you're talking about an extreme amount of your time spent per file, and mixed results.
  • You've got an array with multiple users so, I assume that this is a server and not a workstation. Or possibly it is a SAN with multiple servers(users)? What I don't understand is how the users are responsible for the missing backups.

    In the case of a server, it is obvious that a server level backup should be backing up the data of all users. If you are running a SAN and you are referring to certain servers not backing up their particular partitions, it is still inexplicable because you should be doing SAN level backups in this case.

    So, while it is easy and common to blame the users for not doing backups, this case sounds more like a management problem on the network side. If it's in the datacenter it should be backed up. Period! If it's not in the datacenter, well, we all take our chances don't we. If you run without backups it is completely unreasonable to expect any form of safety from total data loss.
  • and companies who specialize in data recovery have problems with ReiserFS RAID arrays, it looks as if computer forensics people might have the same problem. Thus, a ReiserFS array is the way to go.

    Of course, the normal precautions (encryptions, performing sensitive tasks on a machine not on the network, et al) still apply.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...