Forgot your password?
typodupeerror
Data Storage

Ask Slashdot: How Do You Test Storage Media? 297

Posted by timothy
from the give-her-some-storage-tarot-cards dept.
First time accepted submitter g7a writes "I've been given the task of testing new hardware for the use in our servers. For memory, I can run it through things such as memtest for a few days to ascertain if there are any issues with the new memory. However, I've hit a bit of a brick wall when it comes to testing hard disks; there seems to be no definitive method for doing so. Aside from the obvious S.M.A.R.T tests ( i.e. long offline ) are there any systems out there for testing hard disks to a similar level to that of memtest? Or any tried and tested methods for testing storage media?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: How Do You Test Storage Media?

Comments Filter:
  • SpinRite (Score:3, Informative)

    by alphax45 (675119) <kyle.alfred@gmail . c om> on Tuesday April 03, 2012 @01:30PM (#39562185)
    • Re: (Score:2, Informative)

      by Anonymous Coward

      Does their product actually do anything these days? Seems like the last time I used it was when you had the choice of an ARLL or RLL disk controller, haha...

      Anyway, I always stress test my drives with IOMeter. Leave them burning on lots of random IOP's to stress test the head positioners and don't forget to power cycle them a good number of times. Spin-up is when most drive motors will fail and when the largest in-rush current occurs.

      • Re: (Score:3, Informative)

        by alphax45 (675119)
        Still works 100% as HDD tech is still the same - just don't use on SSD's
        • Re:SpinRite (Score:5, Informative)

          by SuperTechnoNerd (964528) on Tuesday April 03, 2012 @02:22PM (#39562855)
          Spinrite is not that meaningful thees days since drives don't give you that low level access to the media like the days of old. Since you can't low level format drives which was one of spinrites strong points, save money and use badblocks. Use badblocks in read/write mode with random test pattern or worst case test pattern a few times. Then do a SMART long self test. Keep an eye on the pending sector count and the reallocated sector count. A drive can remap a difficult sector without you ever knowing unless you look there. Also keep an eye on drive temperature, even a new drive can act flaky if it gets too hot.
          • Re:SpinRite (Score:5, Informative)

            by NotBorg (829820) on Tuesday April 03, 2012 @04:21PM (#39564571)

            A drive can remap a difficult sector without you ever knowing unless you look there.

            Except when it happens a lot... then your drive is F***.............I.....N...............G slow for now apparent reason.

            We "test" our drives by filling them with whatever data we have laying around. We do this 5 to 10 times (depending on how soon we need the drives). What eventually happens with a bad drive is that the SMART counter ticks over to some magical number and starts reporting health issues (A requirement for some RMA processes). We also time each fill cycle. We expect the first two or three runs to take longer (EVERY drive these days will have relocations going on for the first few runs). For later runs we expect to see a more consistent fill time and the relocated sector count stop climbing so alarmingly fast.

            There are bad sectors on your brand new drive. You can count on it. You have to make the drive find them and map around them because it won't happen in the factory. Write to every byte several times. Don't wait for it to happen naturally... you'll just hit performance problems and put yourself closer to warranty cutoff time. They're counting on you not finding a problem soon enough. You must burn them in or suffer later.

            • There are bad sectors on your brand new drive. You can count on it. You have to make the drive find them and map around them because it won't happen in the factory.

              In the MFM/RLL days, SCSI disks were tested in the factory and came with a list of known bad C/H/S locations, and also keeps a list for bad sectors developed afterwards. I forgot whether the controller board had to skip those sectors during LBA translation or the OS had to not use them.

              When IDE drives came out, the 'factory list' suddenly disapp

        • by yacc143 (975862)

          LOL, RLL harddiscs had capacities that are by today standards located somewhere between the CPU cache size and the RAM size of average smartphones.
          (My first PC had a 20MB HDD)

          Put simply, a modern hdd are about the same as a RLL hdd, as a Cadillac is similar to a tryke.

        • Re:SpinRite (Score:5, Informative)

          by linebackn (131821) on Tuesday April 03, 2012 @02:49PM (#39563255)

          >Still works 100% as HDD tech is still the same

          Not entirely true. Back in the days of MFM/RLL drives, SpinRite could perform a "low level" format on each track. This ensured every last magnetic 1 and 0 was re-written to the disk. Back in the day I witnessed many times when SpinRite would completely recover bad sectors, presumably damaged by electrical/controller issues rather than physical surface issues, and a full pattern test would prove the space was safe to use.

          Modern IDE drives don't allow low-level formatting, and as far as I know, even re-writing the user content of the drive does not re-write sector header data. Modern IDE drives also have hidden reserved space for "spare" tracks and space where they store their firmware, which likewise never gets tested or re-written.

          Additionally, on MFM/RLL drives SpinRite could use low-level formatting to optimize the sector interleave for the specific system. (You would be surprised, moving some disks and their ISA controllers to a faster system would actually require a higher interleave, slowing them down incredibly until SpinRite was run)

          Still, SpinRite is the only program that I know of that can do a controlled read/write pattern test and modify the underlying file system when needed.

          • Modern IDE drives don't allow low-level formatting, and as far as I know, even re-writing the user content of the drive does not re-write sector header data.

            Enhanced Secure Erase should get a bit further than the user content. When initiating that, the drive internally uses the more detailed information it knows about itself to perform a deeper erase (at least in theory). For this kind of operations I recommend Parted Magic [partedmagic.com] (works for nuking SSDs too).

        • Re:SpinRite (Score:5, Informative)

          by washu_k (1628007) on Tuesday April 03, 2012 @02:52PM (#39563301)
          Spinrite may do an OK job of exercising disks, but 90% of what it claims to do is BS.

          An easy test to prove that Spinrite is BS is run it against a USB key. Not a SATA SSD, but a USB flash drive. Make the USB key bootable with DOS, put Spinrite on and boot a PC with no other drives. Run its "tests" against the USB key. All the "low level" tests Spinrite claims to do will appear to work, but are impossible on a USB device.

          Infact, they are impossible on a modern mechanical HD as well. As yacc143 pointed out, modern drives are not the same as MFM/RLL drives of the past. The low level tests that Spinrite claims to do are simply impossible.

          It's also a terrible data recovery program, since it can only write recovered data back to the same disk. That's a data recovery 101 no-no, and Spinrite fails.
          • Re: (Score:3, Informative)

            by alanmeyer (174278)

            Spinrite may do an OK job of exercising disks, but 90% of what it claims to do is BS.

            This is a very uniformed opinion about Spinrite. Spinrite has a large population of testimonials that prove that "it works". It's main purpose is data recovery and data maintenance on magnetic-based rotational media.

            Your example of a USB drive is just another way of saying "flash", for which Spinrite is not targeted to fix.

            Indeed, there are no more "low level" commands like in the day of old HDD technology. However, Spinrite uses the standard ATA command set to do everything possible to get your data off

          • by mixmasta (36673)

            You misunderstand ... Spinrite exercises the drive and the drive heals itself. Steve's pretty clear about that.

      • by rickb928 (945187)

        SpinRite is excellent for testing. If your drives run as hot as the old Hitatchi drives did, it doubles as a space heater or makeshift stove.

        Seriously, SpinRite will exercise a drive very well indeed. And it will tell you more than the manufacturer wanted you to know.

    • by NIN1385 (760712)
      SpinRite is a good one, have used.
    • by cpu6502 (1960974)

      What about motor failure? My last drive became inaccessible when the motor stopped spinning (6 months continuous spin, followed by power failure, followed by no spin).

      • Re:SpinRite (Score:5, Funny)

        by NEDHead (1651195) on Tuesday April 03, 2012 @02:07PM (#39562681)

        Did you restore power?

      • by rrohbeck (944847)

        That may have been stiction. It was a big issue for some time a couple of years ago. Media was textured in the landing zone so the heads wouldn't stick on the super smooth data surface but the head retraction mechanisms weren't perfect so the head did sometimes land in the data zone when power failed. Chances are it gets stuck there.
        These days everybody uses ramp loading and the head isn't allowed to touch the disk ever.
        Power cycling the system (after a full backup) to check if it comes back is still good a

    • Anything coming from GRC is, IMO, awesome by default.
  • by Quiet_Desperation (858215) on Tuesday April 03, 2012 @01:34PM (#39562219)

    I've hit a bit of a brick wall when it comes to testing hard disks

    Have you tried throwing them against the brick wall?

  • Why? (Score:5, Insightful)

    by headhot (137860) on Tuesday April 03, 2012 @01:35PM (#39562233) Homepage

    Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.

    • by Shikaku (1129753)

      The point is to know whether it's faulty now at the time of arrival rather then 2 weeks down the line where it becomes a problem.

      • Re:Why? (Score:5, Insightful)

        by Shagg (99693) on Tuesday April 03, 2012 @01:40PM (#39562319)

        No, the point is to design your system so that if it fails 2 weeks down the line... it isn't a problem.

        • by na1led (1030470)
          I agree. You can't test when a piece of hardware is going to fail. I've purchased many Hard Drives for our servers, sometimes they last years, sometimes they fail after a few weeks. There is no way to tell.
          • by Sir_Sri (199544)

            That isn't really true. Lots of hard drives have various states of failure, and you might be able to write data to it even if it has SMART errors. There isn't a universal way to tell if a drive is going to permanently die.

            A classic example is a hard drive 'clicking'. The read head is contacting something intermittently, but it may still appear to work. You want to get that data off and onto another drive ASAP. Now if you get a drive out of the box like that, there's no point in even putting it into a ma

            • by djsmiley (752149)

              Ah yes, except the clicking is commonly caused by a lack of oil between the ball bearings in the motor.

              No such failure.

            • The click is a read error. Drive cannot read a sector, so it moves the arm all the way to the edge of the platter to recalibrate (that's the "click"), then moves it back in and tries to read the sector again.

        • by ByOhTek (1181381)

          I think both are important.

          If you have the time to test now, it will save you the hassle of swapping it out later.

          • Re:Why? (Score:5, Insightful)

            by Joce640k (829181) on Tuesday April 03, 2012 @01:56PM (#39562533) Homepage

            Point is: You can't 'test'.

            You can only tell if it's working, not when it's about to fail.

              If people could predict when hard drives were going to fail we wouldn't need RAID or backups.

            • Re:Why? (Score:5, Informative)

              by CAIMLAS (41445) on Tuesday April 03, 2012 @05:42PM (#39565751) Homepage

              To a degree, you can rule with certainty that everything is working.

              New equipment does tend to have ghosts. Given enough systems, with homogeneous roles, it doesn't matter: if it starts to fail, you pull it and put another one in.

              If you've got an environment with only a few servers with dedicated roles, having a new 'production server' go tits up is a very bad thing. For a system like this, you really do want to do a 'burn in' period, IMO for at least a couple weeks, where the system is not being depended upon. Your 4-year-old system doing the same thing at relatively diminished capability is not nearly as bad as doing a cut-over and having things go south, then.

              You do, however, want to do a "burn in" on that new equipment. My preference is to stress a new piece of equipment with something like building kernels (which will stress every significant subsystem to some degree) while doing file operations (eg. something like bonnie+ if you're not copying files to the machine) for a period of at least a week without any stability or significant performance problems. This is due to the following subjective observations:

              * getting a system with a defective disk is not uncommon these days. It's not common, so it's not a serious concern.
              * Short of initial failure of the disk/DOA status, the disks will likely run a number of months before your first failure (depending on how many you've got, of course)
              * Instability, inconsistent behavior, flaky RAM, or odd behavior from RAID or NIC controllers, and 'ghosts' can almost invariably be traced back to the PDU or PSU. These seem to die within about two weeks to a month if they're defective/poorly designed. With a server, troubleshooting this can be a huge bitch due to how loud they are and the multiple-dependence issue on the PDU. This is kind of an end game for me, and I have a hard time trusting any of the equipment after I've had a PSU fail.
              * if you plan on taxing the system at all, you'll probably have a driver related performance problem somewhere down the line. Better to find it before you need the performance.
              * Every once in a while, you've got a bad solid state device (RAM, CPU, SSD). These seem to either work, or not work, if they pass initial "does it work?"

          • by Lumpy (12016)

            your "test" will tell you nothing at all.

            If you are a PHB that thinks that "testing" is important and apply it to everything then hire someone at $8.00 to waste their time.

            You might have them run some software to test the server enclosures and 19" racks at the same time.

      • Re: (Score:3, Informative)

        by jeffmeden (135043)

        Hard drives, amazingly, are tested pretty effectively before leaving the factory. During tests in a controlled environment it was demonstrated that hard drives show no "failure curve" at onset, but follow a very boring, linear progression throughout their lifespan. The result: if you don't screw up when you install it you have little to worry about on day 1 that is different from day 1000, which is the cold reality that all mechanical devices will fail.

        Cue the "but I have seen so many DOA drives from XYZc

        • A plastic strap won't save you from the drive head failing to move. I've seen this happen when a bunch of unemployed temp workers unload the truck. This is why it seems "batches" of similar drives fail if you are getting them from the same source... some asshole was throwing and kicking the boxes around.
          • Re: (Score:3, Insightful)

            by jeffmeden (135043)

            A plastic strap won't save you from the drive head failing to move. I've seen this happen when a bunch of unemployed temp workers unload the truck. This is why it seems "batches" of similar drives fail if you are getting them from the same source... some asshole was throwing and kicking the boxes around.

            If your static strap is made of (all) plastic, then you will have issues beyond shipping and handling woes...

        • During tests in a controlled environment it was demonstrated that hard drives show no "failure curve" at onset, but follow a very boring, linear progression throughout their lifespan.

          Interesting. Didn't the Google study on disk reliability [slashdot.org] show a distinct infant mortality spike in the beginning with a lowest failure rate between 1-2 years of age and after 2 years of age a sharp increase in failure rate quickly reaching a certain plateau? What you describe seems to be quite different.

          • by jeffmeden (135043)

            During tests in a controlled environment it was demonstrated that hard drives show no "failure curve" at onset, but follow a very boring, linear progression throughout their lifespan.

            Interesting. Didn't the Google study on disk reliability [slashdot.org] show a distinct infant mortality spike in the beginning with a lowest failure rate between 1-2 years of age and after 2 years of age a sharp increase in failure rate quickly reaching a certain plateau? What you describe seems to be quite different.

            Actually no that was the study I was referring to and it didn't show anything like the "Bathtub" curve you describe. The AFR for drives 0-1 year old was steady for the first year (at most 1% higher for drives new to 3 months than in the first year), and from 1-2 years it held steady and then 2-3 years it rose precipitously until in year 5 when it became statistically chaotic (likely due to drives that were suffering from more obscure failures than the normal bearing or platter wear-out.) Basically, it dem

            • There is a very slight bathtub type curve - all numbers rounded, it's about 3% AFR in the first quarter (i.e. about 0.75% failures in first quarter) and 2% for drives in the 3-12 month range (i.e. about 1.5%). If I read the statistics presentation there right 33% of first year failures look to happen in the first quarter, which is detectable but minor initial higher rate. That's dwarfed by 1-2 year AFR (about 8%) and 2-3 year AFR (about 9%), but drops slightly after that.

              They presented the AFRs rather tha

      • Re:Why? (Score:5, Interesting)

        by v1 (525388) on Tuesday April 03, 2012 @03:18PM (#39563753) Homepage Journal

        The point is to know whether it's faulty now at the time of arrival rather then 2 weeks down the line where it becomes a problem.

        I would disagree. I believe it's best to be able to identify the first moment a hard drive is starting to have problems, rather than the condition its in when you get it.

        One reason is that most of your hard drives will eventually develop a problem, and only a small fraction of the drives you buy will arrive defective.

        Another reason is that nothing of value is on the new drive, you are risking only purchase price. A year from now, you may have important, possibly irreplaceable or at least inconvenient things to replace.

        I run a piece of custom software I wrote that does a slow "disk crawl", reading ~100mb every 5 minutes. Over the course of a month it has read every block on the drive, and starts over. I get an email if an i/o error OR slow performance is encountered. I store a lot here, I have somewhere around 25TB of storage under the roof at home. Over the years I've been notified ~8 times of a failing drive. In all cases I was able to replace it before it became inaccessible. One of them failed to spin up ever again the day after I removed it from service. I consider this a very good system, and am surprised not to see a similar commercial offering. (it's a 5,600 line bash script!)

        SMART is only useful to possibly confirm that a drive has a problem. Only a fool relies on it to notify them when there's a problem. I've probably replaced somewhere around 750 hard drives here at work, and of those, under a dozen were still accessible and displaying a SMART failure. Many times I've had SMART toggle to failed while I was doing data recovery to a replacement drive, as I was fighting my way through I/O errors. Got some Cpt Obvious going on there I think.

    • by NIN1385 (760712)
      Yeah, best advice for any data storage:

      BACKUP, then back that up... then back that up... then back that up offsite.
    • Re:Why? (Score:5, Insightful)

      by gregmac (629064) on Tuesday April 03, 2012 @01:44PM (#39562379) Homepage

      Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.

      And then what you should test is that it actually notifies you when something does fail, so you know about it and can fix it. You can also test how long it takes to rebuild the array after replacing a disk, and how much performance degradation there is while that is happening.

    • Re:Why? (Score:4, Insightful)

      by windcask (1795642) on Tuesday April 03, 2012 @02:19PM (#39562817) Homepage Journal

      Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.

      All RAID 5 does is move the single point of failure from the disk itself to the RAID controller, which could also fail at any time. This is why a truly effective solution is virtual machine redundancy with seamless failover and a rigorous backup schedule.

      • by DarkOx (621550)

        That is not the only solution. There are plenty of multi-path, Multi-controller SAN solutions out there. You can install more than one HBA in your hosts.

        However! Once you are talking about controller or HBAs as failure points you need be rethinking your architecture. Disk failures are pretty common but its very unlikely silicon is going to up and die on you. If you can't tolerate those rare events. You really need to be looking at some cluster / application layer redundancy.

        You will never eliminate all

      • by Lumpy (12016)

        IT's more fun than that.

        Two storage vaults with 12 Hard drives each.
        Running a nice Raid 5 or raid 6 or even a raid 50 or 60.

        Knock loose ONE of those USCSI connectors going to a drive cage.

        Raid is toast. I dont care WHAT raid you are running, none of them can withstand a loss of 50% of the drives.

        Raid is NOT a backup. it's high availability mitigating the highest failure rate part, the hard drives.

  • Hard Drive Testing (Score:2, Informative)

    by Anonymous Coward

    In previous jobs, I've used the system of:
    Full Format, Verify, Erase, then a Drive fitness test.
    If there are errors in media, the Format, verify or erase will pick it up, then the fitness test to check the hardware.
    Hitachi has a Drive Fitness test program
    I have also used hddllf (hddguru.com)

    • by sirsnork (530512)

      ALL the drive manufacturers have drive testing software, WD's is called Data Lifeguard.

      Download the software from the drive manufacturer, run extended test which will do a full disk scan for bad sectors. Then do their full write test.

      If it passes those, it's good to go... until it fails

  • It's a joke. I've seen drives work fine for years with it showing imminent drive failure and I've seen drives die instantly with no warning given whatsoever.

    There is no perfect tool that I could say, each drive manufacturer makes their own, and there are numerous third party tools out there as well. My best advice is have them all and have them handy. One I use quite a bit is HDD Regenerator, pretty thorough utility but it takes some time to run.
    • by NIN1385 (760712)
      Edit: HDD Regen is more aimed at repair just fyi.
    • Re:S.M.A.R.T. (Score:5, Informative)

      by DigiShaman (671371) on Tuesday April 03, 2012 @01:49PM (#39562445) Homepage

      S.M.A.R.T is a joke, but not in implementation. It's a joke because most HDD failures occur on the logic board. It's a known fix in data recovery services to simply swap out the PCB for another of the same vintage make/model/firmware rev. Though I have ran tools such as HD Tune to view out-of-spec metrics and benchmarks. For example, I once had a user that reported that her workstation was running extremely slow. I suspected the drive was at fault and the graphs proved it, but technically it wasn't a failure. S.M.A.R.T would have flagged it if it was mechanical, but it wouldn't have if it was a controller issue. Now that may have changed with newer drives, but that's been my overall experience.

      • by NIN1385 (760712)
        True, although most IT departments and/or computer shops don't have a spare board for every make and model of HDD. We tried a few swaps at a shop I used to work at and it was successful I think one time, and that was pure luck that we had that exact PCB for that drive.
    • by ethan0 (746390)

      SMART is good for telling you when your drives do have problems that need addressing. it's not so great for giving you assurance that your drives do not have problems - consider a positive smart result to be more of an "I don't know" than a "good". you should generally assume your drives can fail at any time. I don't think there's any way to reliably predict the sudden death of a drive.

  • by rs79 (71822) <hostmaster@open-rsc.org> on Tuesday April 03, 2012 @01:37PM (#39562255) Homepage

    don't use consumer drives if you're concerned.

    see also http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/archive/disk_failures.pdf [googleusercontent.com]

    The Goog wrote a nice paper on hard drives.

    • Re:scsi (Score:4, Insightful)

      by Galactic Dominator (944134) on Tuesday April 03, 2012 @02:03PM (#39562627)

      Perhaps an honest mistake, the link is broken. Second, evidence has shown SATA are more reliable than commercial/enterprise grade drive. Only buy those if you don't like your money, or there is some clear advantage. That supposed advantage is not reliability, unless there is there is some sort of rapid replacement mechanism coming with the drive. Although replacement isn't reliability in my book.

      http://lwn.net/Articles/237924/ [lwn.net]

      • by CastrTroy (595695)
        I wonder if this is anything due to the higher RPM of commercial/enterprise drives. Consumer drives usually top out at 7200 RPM. while I've seen enterprise drives that go as high as 15000 RPM. That difference is probably enough to account for a significant reduction in drive life. If the extra spin speed is really necessary, go for the faster drives, but in many cases, the faster speed won't get you much except decreased lifespan.
  • by jader3rd (2222716) on Tuesday April 03, 2012 @01:38PM (#39562299)
    Jet Stress [microsoft.com] does a good job of runnig the storage media through a lot of work.
  • The usual (Score:5, Informative)

    by macemoneta (154740) on Tuesday April 03, 2012 @01:39PM (#39562309) Homepage

    All I usually do is:

    1. smartctl -AH
    Get an initial baseline report.

    2. mke2fs -c -c
    Perform a read/write test on the drive.

    3. smartctl -AH
    Get a final report to compare to the initial report.

    If the drive remains healthy, and error counters aren't incrementing between the smartctl reports, it's good to go.

    • I recently ran mke2fs -c -c on a 2TB USB3 drive. It took 48 hours to complete. As drives get bigger, this is going to take even longer.
      • A 2-day burn-in is minimal for new hardware. Also, as drives get bigger, they also get faster: IDE, SATA1, SATA2, SATA3, Thunderbolt, ...

  • This is what I use (Score:4, Interesting)

    by Wolfrider (856) <kingneutronNO@SPAMyahoo.com> on Tuesday April 03, 2012 @01:43PM (#39562363) Homepage Journal

    root ~/bin # cat scandisk
    #!/bin/bash

    # RW scan of HD
    argg='/dev/'$1

    # if IDE (old kernels)
    hdparm -c1 -d1 -u1 $argg

    # Speedup I/O - also good for USB disks
    blockdev --setra 16384 $argg
    blockdev --getra $argg

    #time badblocks -f -c 20480 -n -s -v $argg
    #time badblocks -f -c 16384 -n -s -v $argg
    time badblocks -f -c 10240 -n -s -v $argg

    exit;

    ---------

    Note that this reads existing content on the drive, writes a randomized pattern, reads it back, and writes the original content back. With modern high-capacity over-500GB drives, you should plan on leaving this running overnight. You can do this from pretty much any linux livecd, AFAIK. If running your own distro, you can monitor the disk I/O with ' iostat -k 5 '.

    From ' man badblocks '
    -n Use non-destructive read-write mode. By default only a non-destructive read-only test is done. This option must not be combined with the -w option, as they are mutually exclusive.

  • There are many free tools for doing a surface scans of a hard drive, that test for bad sectors. Usually it's bad sectors that cause Hard Drives to fail, and that's all you can really test anyway. Hard Drives will fail, it's just a mater of time, that's why you need redundancy. Other than testing, keeping the System Cool, and Dust Free is all you can do.
  • Hard Disk Sentinel (Score:3, Insightful)

    by prestonmichaelh (773400) on Tuesday April 03, 2012 @01:46PM (#39562413)
    Hard Disk Sentinel: http://www.hdsentinel.com/ [hdsentinel.com] is a great tool They even have a free Linux client. What it does over SMART is that it takes the SMART data and weights them according to indications of failure, then gives you a score of 0-100 (100 being great, 0 being dead) as to how healthy the drive is. We use this extensively and have created NAGIOS scripts that monitor the output. Generally, if a drive has a score of 65 or higher, I will generally continue using it (pretty much all my setups are RAID 10 or RAID 6). If the score starts dropping rapidly (a few points every day, even if it started high) or gets below 65 or so, I go ahead and replace it. It has helped out a bunch.

    Even with that, using the SMART data, in a SMART way, still only predicts about 30% of failures. The other 70% will come out of no where. That is why it is best to assume all drives will die at anytime and are suspect and never allow a single drive to be the sole copy of anything.
  • When it comes to media, even with SMART your drives will work 'till they die, and there's no way to predict that with a test.

    Given that, your best option is to ensure that the drives are performing as expected. I've found many a faulty drive with IOZONE.

    http://www.iozone.org/ [iozone.org]

  • old timers look here (Score:2, Interesting)

    by vlm (69642)

    OK so that was the noob version of the question.

    I have a question for the old timers. has anyone ever implemented something like:
    1) log the time and temp
    2) do a run of bonnie++ or a huge dd command
    3) log the time and temp
    4) Repeat above about ten times
    5) numerical differentiation of time and temp and also any "overtemps"

    In theory run from a cold or lukewarm start that could detect a drive drawing "too much" current or otherwise being F'd up, or cooling fan malfunction
    I'm specifically looking for rate of te

    • by na1led (1030470)
      Seems like a big waste of time to me.
  • by HockeyPuck (141947) on Tuesday April 03, 2012 @01:51PM (#39562473)

    I manage a team that oversees PB of disk, both within an enterprise array and internal to the server. For testing the arrays, since there's GB of cache in front of the disks, I can only rely on the vendor to do the appropriate post installation testing to make sure there are no DOA disks. For internal disks, as others have mentioned you could run IOMeter for days without a problem and then the very next day it's dead. Unlike memory, disks have moving parts that can fail much easier than chips. However, with proper precautions like RAID, single disk failures can be avoided.

    The bigger problem is having a double disk failure. This is due to the amount of time required to rebuild the failed disk. Back when disks were 100GB this was a "relatively" quick process. However, in some of my arrays with 3TB drives in them, it can take much longer to replace the drive. Even to the point whereby having hotspares has been considered to be not worth it as my array vendor will have a new disk in the array within 4hrs. With what an enterprise disk costs from the array vendor (not Frys), it can start to add up.

    • by na1led (1030470)
      It depends on how critical your data is. There are many different types of Raids, like Raid 6, 10, or 50, and if you have a good storage unit with redundant controllers and a couple of hot spare, you should be all set. The chances of a total failure of multiple drives/controllers is very slim, and that's what nightly backups are for anyway. We use a Dell PowerVault MD 3220 - Dual Controllers, Dual Power Supplies, Raid 50 with 2 hot spares. Chances of losing data from Hard Drive failure is like winning the
    • by CAIMLAS (41445)

      If you've got 3TB drives in use in RAID, you're a fool for not running double parity. Like you said, the time required is just too long; you need to be able to survive a 2-disk failure.

  • by Mondragon (3537) on Tuesday April 03, 2012 @01:53PM (#39562493) Homepage

    Not completely related to how to test, but...

    In 2007 Google reported that for a sample of 100k drives, only 60% of their drives with failures had ever encountered any SMART errors. Also, NetApp has reported a significant amount of drives with temporary failures, such that they can be placed back into a pool after being taken offline for a period of time and wiped. Google also had a lot of other interesting things to say (such as heat has no noticeable effect on hard drive life under 45C, that load is unrelated to failure rates, and that if a drive doesn't fail after 3 months, it's very unlikely to fail until the 2-3 year timeframe.

    You can find the google paper here: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf [googleusercontent.com]

    A few other notes that you can find from storage vendor tech notes if you own their arrays:
      * Enterprise-level SAS drives aren't any more reliable than consumer SATA drives
        - But they do have considerably different firmwares that assume they will be placed in an array, and thus have a completely different self-healing scheme than consumer-level drives (generally resulting in higher performance in failure scenarios)
      * RAID 5 is a really bad idea - correlated failures are much more likely than the math would indicate, especially with the rebuild times involved with today's huge drives
      * You have a lot more filesystem options that might not even make sense to use with a RAID system, like ZFS, as well as other mechanisms for distributing your data at a layer higher than the filesystem

    Ultimately the reality is that regardless of the testing you put them under, hard drives will fail, and you need to design your production system around this fact. You *should* burn them in with constant read/write cycles for a couple days in order to identify those drives which are essentially DOA, but you shouldn't assume any drive that passes that process won't die tomorrow.

  • I mirror data and test it periodically with rsync using the dry-run (-n) and checksum options (-c) to do a full comparison. I usually have more confidence in a new disk after I've done this a few times.

  • I have a favorite boulder that has served my burn-in testing needs pretty well. Would you like a photo so you can chisel your own? I added some LED bling to mine.

  • I mean, really, someone working at slashdot doesn't know this? This is about as basic a question as it gets when it comes to hardware.

    Raid 5, a solid backup scheme, and a storage closet full of replacement drives. There is no good way to test HDDs.

    Due entirely to the fact that they are a WEAR item it is only possible to decide which brand you trust the most.

    Other than that, if its a big job and a lot of HDDs are going to be bought you could take 10 of each candidate drive and run them through spinrite till

  • The storage team where I work use a program called IOmeter. Ot runs on Windows and Linux (and I think other platformd as well) and is open source. They were using it for stress testing SAN storage but I think it would work for locally attached storage as well.
  • by Jumperalex (185007) on Tuesday April 03, 2012 @02:02PM (#39562617)

    http://lime-technology.com/forum/index.php?topic=2817.0 [lime-technology.com] ... the main feature of the script is
    1. gets a SMART report
    2. pre-reads the entire disk
    3. writes zeros to the entire disk
    4. sets the special signature recognized by unRAID
    5. verifies the signature
    6. post-reads the entire disk
    7. optionally repeats the process for additional cycles (if you specified the "-c NN" option, where NN = a number from 1 to 20, default is to run 1 cycle)
    8. gets a final SMART report
    9. compares the SMART reports alerting you of differences.

    Check it out. Its "original" purpose was to set the drive to all "0's" for easy insertion into a parity array (read: parity drive does not need to be updated if the new drive is all zeros) but it has also shown great utility as a stress test / burn-in tool to detect infant mortality and "force the issue" as far as satisfying the criteria needed for an RMA (read: sufficient reallocated block count)

    If your skill level is enough to adapt the script to your own environ then great, otherwise UnRaid Basic is free and allows 3 drives in the array which should allow you to simultaneously pre-clear three drives. You might even be able to pre-clear more than that (up to available hardware slots) since you aren't technically dealing with the array at that point, but with enumerated hardware that the script has access to which should be eveything on the disc. Hardware requirements are minimal and it runs from flash.

  • A good Storage Unit will do a good job at maintaining the Hard Drives you purchased and keeping them safe. They can also handle problems with drives to prevent data loss, and notify you when a drive is about to fail. A good SAN or DAS is what I would purchase. We purchased a Dell Powervault DAS, and have been very happy with it, I never worry about Hard Drives failing because I know the DAS will take care of it. Some companies like Dell and EMC will know if you have a bad hard drive, and ship you a new one
  • we use HP servers and HP ships a suite of software to install on the server along with the OS. they monitor the hardware and warn you of any problems. unless you like doing things the hard way, this was solved years and years ago

    i have a bad hard drive i call HP, send them a log file and in 2 hours i have a new one delivered

  • Testing hardware by exercising it is like testing matches to see if they are good.
  • http://www.heise.de/download/h2testw.html [heise.de] - switchable to English of course.
    While it is primarily advertised for flash media these days (and indispensable since there have been numerous forgeries or DOAs at least on the European market lately), it evolved as an HDD tester in the first place.

    On Linux in particular, a combination of dd and smartctl (before&after writing the entire disk, as well as for self-tests) may come in handy too, of course.
  • It takes a while, but if you really want to be sure of your hardware (as sure as you can be, at least.)

    Check the SMART status. If there are any re-allocated sectors, make note of the number.

    Run badblocks with the -w switch against the drive (from a Linux live cd of your choice, for example)

    That should completely read/write test the drive 4 times with multiple patterns. There should be no errors reported. This test will take longer than overnight on modern drives.

    Check the SMART data again. Be wary if th

  • Ears (Score:4, Informative)

    by Maximum Prophet (716608) on Tuesday April 03, 2012 @02:37PM (#39563075)
    Most everything above is good, but don't overlook the obvious. Spin the drive up in a quiet room and listen to it. If it sounds different from all the other drives like it, there's a good chance something is wrong.

    I replaced the drive in my TiVo. The 1st replacement was so much louder, I swapped the original back, then put the new drive in a test rig. It started getting bad sectors in a few days. RMA'd it to Seagate, and the new one was much quieter.
  • Send your drives to me, postage paid. I'll test them for you for no charge, and send them back to you before the warranty expires.
  • Or any tried and tested methods for testing storage media?
    #2 pencil, Scantron sheet and the test.

    Works for me everytime.

  • badblocks (Score:5, Interesting)

    by Janek Kozicki (722688) on Tuesday April 03, 2012 @02:56PM (#39563371) Journal

    badblocks -c 10240 -s -w -t random -v /dev/sda1

    that's my standard test for all HDDs

  • while label_visible(media):
            apply_lighter_fluid(media)
            ignite(media)
            wait_while_burining(media)
    return media_is_bad

  • My suggestions (Score:5, Informative)

    by Liquid-Gecka (319494) on Tuesday April 03, 2012 @03:53PM (#39564219)

    Speaking as somebody that has done hardware qualifications and burn-in development at very large scale for companies you ahve heard of let me tell you the tools I use:

    fio: The _BEST_ tool for raw drive performance and burnin testing. A couple of hours of random access will ensure the drive head can dance, then a full block by block walk through with checksum verification will ensure that all blocks are readable and writable.. I usually do 2 or 3 passes here. You can tell fio to reject drives that do not perform to a minimum standard. Very useful for finding functional yet not quite up to speed drives. The statistics produced here are awesome as well.. Something like 70 stats per device per test.

    stressapptest: This is google's burn in tool and virtually the only one I have ever found that supports NUMA on modern dual socket machines. This is IMPORTANT as its easy to ignore issues that come up with the link between the CPUs. The various testing modes give you the ability to tear the machine to pieces which is awesome. Stressapptest also is the most power hungry test I have ever seen, including the intel Power testing suite that you have to jump through hoops to get.

    Pair this with a pass of memtest and you get a really, really nice burn in system that can burtalize the hardware and give you scriptable systems for detecting failure.

  • by NoobixCube (1133473) on Tuesday April 03, 2012 @04:34PM (#39564745) Journal

    It may not be sophisticated, but MHDD is what I use at work (among a couple of other tools). Other tools are more reliable in different circumstances, but my first stop is always MHDD, because it will give me a comprehensive R/W delay test on a disk. Extremely practical for a workshop, perhaps not practical for a data centre.

Never make anything simple and efficient when a way can be found to make it complex and wonderful.

Working...