Calculating the Mean Time Between Failures?

Slashdot is powered by your submissions, so send in your scoop

Calculating the Mean Time Between Failures? 100

Posted by Cliff on Thursday June 19, 2003 @08:52PM from the surely-they-don't-test-for-500k-hours dept.

Blue Booger asks: "I was looking over some fibrechannel hard drives and noticed that the Mean Time Between Failures was rated at 1.2 million hours. I thought that was pretty high, and figured it up to be close to 137 YEARS!! I went to check some regular IDE drives just for comparison, and they were rated at 500,000 hours (57 years). Now, as I understand it, this is supposed to be the average time that you can expect the drive to last before failures. I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal? BTW, that is 57 years running 24 hours a day...the MTBF is rated as power on time. Here you can find Western Digital's glossary that defines the term MTBF (pdf). Here you can find a spec sheet on one of their 20GB IDE drives. I checked, and Seagate also lists similar MTBFs. How the heck are they coming up with these numbers?"

This discussion has been archived. No new comments can be posted.

Calculating the Mean Time Between Failures?

Load All Comments

Search 100 Comments Log In/Create an Account

Comments Filter:

Duty Cycle (Score:4, Informative)

by m0rph3us0 ( 549631 ) writes: on Thursday June 19, 2003 @08:59PM (#6249455)

Usually they have a duty cycle associated with an MTBF which can drastically alter the MTBF at a 100% duty cycle.

Share
twitter facebook
- Re:Duty Cycle (Score:2)
  
  by Jeremiah Cornelius ( 137 ) writes:
  
  I NEVER had an IDEdrive last more than 14 years!
  ;-)
- Re:Duty Cycle (Score:3, Informative)
  
  by Blkdeath ( 530393 ) writes:
  
  Usually they have a duty cycle associated with an MTBF which can drastically alter the MTBF at a 100% duty cycle.
  
  PNot to mention temperature. Read the environmental factors very carefully; if you exceed them by even 1 degree celicius you can cut your MTBF equally, if not more drastically.
  - Re:Duty Cycle (Score:1)
    
    by ottothecow ( 600101 ) writes:
    
    or maybe they forgot to power up the drives and when they turned them on after using their time machine to move into the future they miraculously worked
- Re:Duty Cycle (Score:2)
  
  by macdaddy357 ( 582412 ) writes:
  
  Has the MTBF of major brand hard drives gone down in recent years like their warranties? They once had five year warranties, then three, then all at once, major manufacturers scaled their warranties back to one year. Are they cramming way too much data onto the platters, making the technology unreliable, or are they just cutting their costs at the expense of customers? Their shortening of warranties to one year seemingly all at once smells like collusion to me, which violates anti-trust laws. The FTC should
  - Re:Duty Cycle (Score:3, Informative)
    
    by itwerx ( 165526 ) writes:
    
    That 5-year warranty almost put Western Digital out of business when they all started failing at the 4-year mark!
    No, I'm not kidding. Some heads rolled over that...
Do it like the pros. (Score:1, Funny)

by Anonymous Coward writes:

Just make Some Wild Ass Guess(SWAG).

Like, my hrad drive has a MTBF of 300,000 hours.
- Re:Do it like the pros. (Score:2)
  
  by aminorex ( 141494 ) writes:
  
  Hey, Baby, my hard drive has *never* failed.
not just drives... (Score:4, Interesting)

by ryanmoffett ( 265601 ) writes: on Thursday June 19, 2003 @09:04PM (#6249492)

Cisco used to sell Catalyst 3548XL switches that were listed as having a MTBF of 120,000+ hours. Their current replacement for that line (3550)comes in at 163,000+ hours. We had 7 of 24 3548XL switches fail in the first year we had them. They had poor air flow from a tiny fan, no heatsinks and tons of hot chips. The newer model has the same issue, though they did stuff a cheap foam baffle in the case to get air to flow closer to the chips, none of which have heatsinks. I have no idea how they tested them and got a MTBF of 13 years.

Share
twitter facebook
- Something else is going on . . . (Score:1)
  
  by nixman99 ( 518480 ) writes:
  
  My lab has about 5 each 3524, 3548, and 3550-24, average age 2.5 years, and no failures (hardware or crashes). Other offices I know of have had similar experiences with the 3500 series.
  
  Either you just happened to get a bad batch, or you've got environmental problems. Make sure there is sufficient air flow around the units, and check the power harmonics on the circuit. Most consumer grade (read cheap) electronics uses crappy power supplys which cause harmonics on the power line. One or two isn't a big
Simple, it's called "lies" (Score:5, Funny)

by Anonymous Coward writes: on Thursday June 19, 2003 @09:05PM (#6249500)
Sure, the test engineers sit and rub their chins and write numbers on paper and do stupid tests in the lab, but in the end it comes down to this:
- WD Guy 1 Hey, what's the MTBF for our new drive?
- WD Guy 2 Dunno, what's Maxtor saying?
- WD Guy 1 sez here "300,000" hours
- WD Guy 2 okay, ours is 500,000 then
- WD Guy 1 I smell a NEW VICE PRESIDENT
Share
twitter facebook
You are wrong (Score:4, Informative)

by Mensa Babe ( 675349 ) writes: on Thursday June 19, 2003 @09:06PM (#6249502) Homepage Journal

I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal?

If you have twenty drives with twenty years MTFB (Mean Time Between Failures) each, then you have one failure per year on average. These are basic statistics fighting always against you.

Share
twitter facebook
- As a sidenote (Score:1)
  
  by Mensa Babe ( 675349 ) writes:
  
  I might add, that when I was contracted at a server farm, people there used to celebrate every day, when there was no hardware failure, and the record was four times in one month. But have they complained that the producers of hardware were lying to them, stating years of MTBF? No. And that's because they knew the basics of mathematics and knew how to use FDIV opcode in their brains. The only solution is redundancy.
  - Re:As a sidenote (Score:5, Funny)
    
    by NickDngr ( 561211 ) writes: on Thursday June 19, 2003 @09:27PM (#6249628) Journal
    
    DISCLAIMER: The views expressed hereafter are not necessarily those of MENSA, which I am only a member of.
    
    Shouldn't that be "The views expressed hereafter are not necessarily those of MENSA, of which I am only a member." I would think proper grammar usage would be a prerequisite for being a MENSA member.
    
    Parent Share
    twitter facebook
    - - Re:As a sidenote (Score:2)
        
        by Blkdeath ( 530393 ) writes:
        
        Apparently you just need to feel good about yourself or something. ("Look at me! I pointed out the error of a Mensa member....wheeeeee.....")
        
        What of the pomposity of foisting membership in a group of self-righteous egotists on a public forum? Over-compensation, anyone?
        The only problem with her scheme is that Mensa membership is easy to come by. Some people just find it an undesirable association. But if she feels she has to emphasize her membership because perhaps her posts can't stand on their own me
        
        Re:As a sidenote (Score:1)
        
        by bluephone ( 200451 ) writes:
        
        I found it much more gratifying to turn MENSA down, personally. "Please join!" Umm, no thanks. Maybe I'm just an easily amused genius... :)
        
        Re:As a sidenote (Score:2)
        
        by itwerx ( 165526 ) writes:
        
        I couldn't agree more. I knew a guy who was a Mensa member and after a protracted attempt on his part to get me to join I finally went to a meeting.
        Maybe it was just that particular chapter but I got the distinct impression that they were a bunch of egotistical losers!
    - Re:As a sidenote (Score:4, Funny)
      
      by The Clockwork Troll ( 655321 ) writes: on Thursday June 19, 2003 @10:03PM (#6249889) Journal
      
      I would think proper grammar usage would be a prerequisite for being a MENSA member.
      
      <input type="radio" name="gift" value="IQ" disabled>
      <input type="radio" name="gift" value="money" disabled>
      <input type="radio" name="gift" value="penis size" disabled>
      <input type="radio" name="gift" value="ability to nitpick trivia" checked>
      
      Parent Share
      twitter facebook
    - Sidestepping sidenotes (Score:1)
      
      by chewy ( 38468 ) writes:
      
      After reading the original poster's (Mensa Babe) slashdot-blurp, I have come to the conclusion that she must be a very intelligent girl with a lot of anger towards all males with their brains in the wrong place.
      
      As for the language bit, ALOT of really intelligent people are totally dislexic when it comes to grammar and spelling, since it makes no sense anyway IMHO :)
- No, you are wrong (Score:4, Insightful)
  
  by anthony_dipierro ( 543308 ) writes: on Thursday June 19, 2003 @09:25PM (#6249619) Journal
  
  Actually, you are wrong... If you have one drive fail per year for 20 years, then the mean time between failures is 10.5 years.
  
  Parent Share
  twitter facebook
  - Re:No, you are wrong (Score:1)
    
    by The Clockwork Troll ( 655321 ) writes:
    
    Uh, how do you mean? (literally)
    Is there some special definition for MTBF that changes how "mean time between" is interpreted?
    - Re:No, you are wrong (Score:1)
      
      by anthony_dipierro ( 543308 ) writes:
      
      A mean is an average. The average of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, and 20 is 10.5.
      - Re:No, you are wrong (Score:2, Informative)
        
        by The Clockwork Troll ( 655321 ) writes:
        
        So you are using the average of the times measured between deployment of hardware and individual drive failure, as opposed to the mean time between failures of individual drives, i.e. the mean time between necessary hardware replacements, which is what I would have thought more useful for evaluating hardware.
        I would have called your average "mean time to failure" vs. "mean time between failures."
        But as other posters mentioned, most of these stats are marketing bunk no matter how they're computed!
        
        Re:No, you are wrong (Score:2)
        
        by crmartin ( 98227 ) writes:
        
        Your interpretation is correct, assuming the exponential distribution (which is the common assumption.)
        
        Re:No, you are wrong (Score:1)
        
        by The Clockwork Troll ( 655321 ) writes:
        
        Do drive failures tend to arrive in something resembling a Poisson manner, in practice?
        
        Re:No, you are wrong (Score:2)
        
        by crmartin ( 98227 ) writes:
        
        Yes. Strictly -- as someone else pointed out -- the failures tend to have a so-called "bathtub" distribution. That is, there's a high failur rate at first ("infant mortality") followed by a long Poisson/Markovian/exponential (you pick your term) stretch, followed by a higher failure rate as it get old. In general, the "lifetime" of a component is the time to the inflection at the end of the exponential portion.
        
        Re:No, you are wrong (Score:1)
        
        by anthony_dipierro ( 543308 ) writes:
        
        I would have called your average "mean time to failure" vs. "mean time between failures."
        
        Or better yet, "mean time before failure," thus preserving the acronym.
      - Re:No, you are wrong (Score:1)
        
        by pompousjerk ( 210156 ) writes:
        
        Yeah, but you can have one for 0 years, too. :)
    - Re:No, you are wrong (Score:1)
      
      by Mensa Babe ( 675349 ) writes:
      
      Is there some special definition for MTBF that changes how "mean time between" is interpreted?
      
      If by some special definition you mean a simple linear multiplication (or division, depanding on your point of view), then yes. Anthony Dipierro probably was mistakenly thinking about a decibel or other logarithmically scaled unit system.
  - Yes, I am right. (Score:1)
    
    by Mensa Babe ( 675349 ) writes:
    
    Actually, you are wrong... If you have one drive fail per year for 20 years, then the mean time between failures is 10.5 years.
    
    If I have twenty drives, each of which is estimated to fail once in a twenty-year period, then in such a twenty-year period every one of those disks is estimated to fail once. These are twenty failures in twenty-year period on average, id est one failure per year. It is actually a matter of very simple mathematics.
    - No, you're wrong... (Score:1)
      
      by anthony_dipierro ( 543308 ) writes:
      
      If I have twenty drives, each of which is estimated to fail once in a twenty-year period, then in such a twenty-year period every one of those disks is estimated to fail once.
      
      No, you're wrong. The average time to failure is 20 years, not the maximum.
      
      The average of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, and 20 is 10.5.
      - Re:No, you're wrong... (Score:2)
        
        by benjamindees ( 441808 ) writes:
        
        Right, she's just making the assumption that the drives were all bought at different times.
        
        That might be valid when you work in a datacenter that replaces x number of drives every year.
        
        On average, if the MTBF of those drives is 20 years, one drive in such a group will fail every year.
    - Re:Yes, I am right. (Score:2, Informative)
      
      by iamroot ( 319400 ) writes:
      
      Mean Time Before Failure is the MEAN time before the disk would fail.
      
      If they all failed within 20 years, how would the average disk have failed in 20 years???
      MTBF (Mean Time Between Failures) â" Average time (expressed in
      hours) that a component works without failure. It is calculated by dividing
      the total number of operating hours observed by the total number of
      failures. Also, the length of time a user may reasonably expect a device or
      system to work before an incapacitating fault occurs.
      20 drives/2
  - mod parent up - he's right (Score:2)
    
    by Paul Jakma ( 2677 ) writes:
    
    someone mod up the parent, they are correct, the mensa babe, despite their mensa membership, is wrong. MTBF of 20 drives, one failling each year is (Sigma(n=1 to 20){n})/20 which is 10.5 years.
    
    A quick thought experiment should make it obvious that for a series of numbers (eg number of years between failure) where the highest number is n, that the mean of these numbers could never be n.
    - Re:mod parent up - he's right (Score:2)
      
      by Paul Jakma ( 2677 ) writes:
      
      urg... with proviso that at least one of the series of numbers is not equal to n.
    - Re:mod parent up - he's right (Score:1)
      
      by anthony_dipierro ( 543308 ) writes:
      
      Hmm... Actually I just thought of something. If you have 20 hard drives, and one fails each year, but you fix it immediately after failure and then put it into service again, then the MTBF would be 20 years.
      
      I guess that's why the term is the mean time between failures rather than the mean time before failure. Back when the term was invented it probably made economic sense to fix a hard drive when it breaks. Nowadays we're more likely to just throw it away.
      - Re:mod parent up - he's right (Score:2)
        
        by Paul Jakma ( 2677 ) writes:
        
        nice try, but if something fails and you then fix it, it has still failed. :)
        
        Re:mod parent up - he's right (Score:1)
        
        by anthony_dipierro ( 543308 ) writes:
        
        Yeah, but the total operating time for each drive is 20 years if you fix them and they last. What I'm saying is, one drive breaks after 1 year, you fix it and it lasts the next 19, one drive breaks after 2 years, you fix it and it lasts the next 18, etc.
        
        Alternatively, you could have 19 of the drives last 20 years, and the other one break once a year.
        Actually, you wouldn't even have to fix them, if you replace them. If you keep 20 drives for 20 years, and one breaks every year, if you replace the one th
        
        Re:mod parent up - he's right (Score:2)
        
        by Paul Jakma ( 2677 ) writes:
        
        if it fails after a year, and you fix it and it lasts another 19 years, MTBF is (1+19)/2 = 10. Its failed twice on you.
        
        Re:mod parent up - he's right (Score:1)
        
        by anthony_dipierro ( 543308 ) writes:
        
        if it fails after a year, and you fix it and it lasts another 19 years, MTBF is (1+19)/2 = 10. Its failed twice on you.
        
        No, I never said it failed a second time. I said it failed after a year, then it lasted another 19 years. Then you stopped the experiment.
        
        Re:mod parent up - he's right (Score:2)
        
        by sstamps ( 39313 ) writes:
        
        Anthony,
        
        I think you are right in this case, even with that consideration.
        
        Since every drive failed once within a 20-year period, then we have all of our data points on the lower half of the normal curve (with the mean of 20 being in the middle), and none above. Thus, the mean for this part of the experiment would actually be around 10 years. However, if we were to continue the experiment out to 40 years or longer, and gain more samples, then it might balance out, but those repaired drives would have to be
        
        Re:mod parent up - he's right (Score:1)
        
        by anthony_dipierro ( 543308 ) writes:
        
        Well, here's the thing. The way the numbers work, you don't have to wait to see each drive fail before calculating the MTBF. As soon as one drive fails, you can make a calculation. In fact, that's one of the reasons that hard drives usually have unrealistically high MTBFs. They test 500,000 drives for 1 hour, and only see 1 failure, so they call the MTBF 500,000 hours. But in reality the expected lifetime of a hard drive is not so cut and dry. It might have very low probability of failing after 1 hour
        
        Re:mod parent up - he's right (Score:2)
        
        by crmartin ( 98227 ) writes:
        
        That's the difference between "statistical inference" and what people who have had a probability class tend to do. If you've got 500,000 drives, test for an hour, and get one failure, you've got some evidence that the failure rate is 1 per 500,000 hours (== MTBF of 500,000 hours in the exponential case.) But the confidence in that estimate is very low. If you then replace the failed drive and do another 1 hour test, and get another single failure, then you still can estimate the MTBF as 500,000 hours --
        
        Re:mod parent up - he's right (Score:1)
        
        by anthony_dipierro ( 543308 ) writes:
        
        That's the difference between "statistical inference" and what people who have had a probability class tend to do.
        
        I don't know if it matters as much whether or not the person has had a probability class as whether or not the person is trying to legally boost MTBF ratings.
        
        Sure, you can get better figures using different methods, but those better figures most likely will be lower, so why bother if you want to sell your product.
        
        Re:mod parent up - he's right (Score:2)
        
        by crmartin ( 98227 ) writes:
        
        Anthony, that's not really true. First of all, for a long part of the lifetime, exponential is a very good approximation. I can't cite it offhand, but I read a paper that showed it was better than 95% accurate. Secondly, the Bayesian method will give measures that you can advertise as correct (and people like Telcordia are very stubborn about it) with a relatively short and inexpensive experiment.
    - Re:mod parent up - he's right (Score:3, Informative)
      
      by photon317 ( 208409 ) writes:
      
      Whatever your math may say, the Industry's standard is the one you disagree with. If they stick 1000 drives in an array, run them for 1 year, and only a single drive fails during that year, the MTBF for that model of drive is 500 years.
      - Re:mod parent up - he's right (Score:2)
        
        by Paul Jakma ( 2677 ) writes:
        
        *sigh* how did you get modded as informative?
        
        The original post that started this thread (from mensa babe) stated "run 20 drives of MTBF 20 years, then one fails each year", a poster replied to say they were incorrect, it would mean drives had MTBF of 10.5 years. And I replied to say the correctee was right. (which they are).
        
        Your example is for 1000 drives, running for 1 year, 1 failure. Going by the same math which i used to back up the aformentioned correctee, ie MTBF = Sigma(runtime)/failures, which is
- No, You are wrong (Score:1, Informative)
  
  by Anonymous Coward writes:
  
  The discrepency emerges because you do not operate the drives past their end-of-life when you make the MTBF calculation. To illustrate assume you have a particular model of hard drive with a MTBF of 57 years, and it reaches end-of-life after 5 years. What you can then conclude is that if you replace the drive every 5 years with a new drive(of the same model and MTBF), then you can expect your first failure at 57 years. Keep in mind that this is really just the most probably time for a failure to occur.
- Re:You are wrong (Score:2)
  
  by sstamps ( 39313 ) writes:
  
  First, as already pointed out above, if you have 20 drives with a simple MTBF of 20 years (as you are alluding), then a single failure of one drive each year for twenty years yields a mean of around 10 years.
  
  Second, manufacturer-specified MTBF has nothing to do with a simple, observational mean. It is a calculated statistic based on a statistical sample and extrapolated over a normal distribution. In real-life terms, it is close to meaningless.
  
  Basically, it does not mean "Mean Time Between observed Failur
  - Re:You are wrong (Score:2)
    
    by sstamps ( 39313 ) writes:
    
    Actually, let me clarify that.
    
    If you are assuming that a drive is repaired and placed back into service, and then fails again after another random 1-20 year period, then you would be correct.
    
    I think the problem is that most of the geeks think in terms of practicality, and treat the situation as MTTF ("Mean-Time-To-Failure") instead of MTBF, since we all know that MTBF is BS anyway.
    
    As a result, I clarify the statement above, myself assuming "simple MTTF", not "simple MTBF".
Here's a wild-ass guess (Score:5, Insightful)

by HotNeedleOfInquiry ( 598897 ) writes: on Thursday June 19, 2003 @09:09PM (#6249527)

First they specify a sample period, perhaps a year. Then they multiply the number of units shipped during that time times the estimated hours per year that the drives are run then divide it by the number of units returned due to failure
For example, shipped 2 million drives last year, each ran 2080 hours ( 8 hours * 52 weeks), roughly 4 trillion hours total. Out of those 2 million units, they got 3466 returns. So the average MTBF was 1.2 million hours.

Share
twitter facebook
- Re:Here's a wild-ass guess (Score:4, Interesting)
  
  by elmegil ( 12001 ) writes: on Thursday June 19, 2003 @09:21PM (#6249602) Homepage Journal
  
  A lot of hardware vendors actually test before they ship. But aside from that your basic math is about right. A controlled number of units is tested (possibly in stressed environments) and used to build the statistics that say what the expected MTBF should be.
  
  Parent Share
  twitter facebook
Look at the definition (Score:4, Interesting)

by aaarrrgggh ( 9205 ) writes: on Thursday June 19, 2003 @09:12PM (#6249552)

If they run 500 drives for 2,000 hours and observe only one failure, that is a MTBF of 500,000 hours.

Unfortunately, that equation doesn't take into account the fact that some equipment degrades over time; if a product is very reliable for 1,000 hours, and less reliable after that, just double the sample size (maybe triple for statistics), and see what you get.

Real reliability calculations are much more difficult than just what users think MTBF means...

Share
twitter facebook
- Re:Look at the definition (Score:2)
  
  by 0x0d0a ( 568518 ) writes:
  
  Yup.
  
  MBTF might have a little bit of value as a relative measure -- i.e. perhaps drives with an order of magnitude higher MBTF will last longer. It's a lot less useful as an absolute measure.
  
  Lots of equipment (not just hard drives) has some sort of estimated lifetime. The best the manufacturer can ever do is estimate under semi-realistic conditions and extrapolate.
Labs (Score:4, Insightful)

by MazTaim ( 1376 ) writes: <taim@@@nauticom...net> on Thursday June 19, 2003 @09:15PM (#6249571) Homepage Journal

That's the key word.

MTBF is probably determined by taking a bunch of drives, putting them into PERFECT conditions that NEVER exist in the real world. Run them in a way that, although test all functionality, really doesn't provide true conditions for drives (IE head always reading/writing up and down the disk probably never seeking, disks always spinning, etc..). Something that drives never do in real life. Statistics...statistics...statistics...(speeling too :)

Share
twitter facebook
Marketing BS... (Score:4, Insightful)

by Alomex ( 148003 ) writes: on Thursday June 19, 2003 @09:23PM (#6249613) Homepage

Anybody who has a large number of drives running knows that the figures have become meaningless over time. They use to predict to the T the expected time of failure. They are now a marketing term assuming "a duty cycle" and computed by an absurd "units x time to failure". Using that system, the MTBF of the Honda Civic engine is 100,000 years as there are 1 million Civic's out there and none of them had their engine seize up in the first month.

Somebody ought to sue them for deceptive advertisement.

Share
twitter facebook
- Ack!! The irony is too much. (Score:2)
  
  by OwnerOfWhinyCat ( 654476 ) * writes:
  
  I'm not picking. REALLY. It's just too funny that someone complaining about bad mathmatics in the thread, and the C compiler, in the same post, would live on a planet with a 10 month year.
  
  float y = (float) 1/2 * x; // should yeild better results.
- Bad Example (Score:2)
  
  by Detritus ( 11846 ) writes:
  
  I had a VW Jetta that blew its engine when it was a few weeks old. One of the pistons disintegrated due to a defect in manufacturing. The dealer sent the parts back to the manufacturer for failure analysis. It's rare, but such things do happen.
Enter The Matrix... (Score:3, Funny)

by HaloZero ( 610207 ) writes: <protodeka@@@gmail...com> on Thursday June 19, 2003 @09:25PM (#6249620) Homepage

Calculating the Mean Time Between Failures?
I prefer to measure time by the emergence of one integral anomoly to the next.

Share
twitter facebook
MTBF... (Score:4, Insightful)

by m0rph3us0 ( 549631 ) writes: on Thursday June 19, 2003 @09:28PM (#6249631)

The best way to determine *REAL* MTBF is how long the drives are warrantied for, no one warranties a product longer than it is supposed to last. When you see a company reduce it's warranty expect quality to drop in accordance.

Share
twitter facebook
MTBF calculation and estimation (Score:5, Informative)

by crmartin ( 98227 ) writes: on Thursday June 19, 2003 @09:40PM (#6249698)

You know, it's almost a shame to screw up the amusing notions /.ers come up with by adding actual information, but I can't help it, all those years of teaching I guess.

Okay, first of all: "mean time between failures" is obviously a statistical measure -- it is an average over a large number of individual items. In most electronic components (including light bulbs!) the statistical distribution of the time between failures is the exponential distribution [wolfram.com], which has the odd property that it's "memory-less" -- it doesn't matter how long since the last failure it's been, the mean time to the next failure will still be the same. A consequence of this is that if the MTBF is 10,000 hours, the probability of failure in any particular hour would be 1/10,000th. So, if you set up 10,000 components, all running simultaneously, you'd expect one of them to fail within the first hour; conversely, if you ran them for 1000 hours, and 998 of them failed, you could be fairly certain that the MTBF would be around 10,000 hours.

Note, by the way, that this is only true when the failure time distribution is exponential -- so it works for electronic components, but not for, say, bicycles and cars and roller skates, which are more likely to fail the older they get.

This has an obvious problem, of course: if the MTBF is high, it can take forever to test. Consider, for example, something I worked on for NASA some years ago: trying to prove that a fly-by-wire system will have a mean time between failures of 1e10 hours. (This is about the same failure rte as the airframe, which is how they came up with the number.) 1e10 hours is about 1.141 million years, by the way.

(Pop quiz: if MTBF is a million years, how do you explain the occasional airframe failure, say, eg TWA 800? Hint: It doesn't require any foul play.)

At that point, you've got a couple of choices: first, you can make a lot of copies and run them simultaneously. Relatively easy for $50 disks, hard for billion dollar 747s.

Second, you can make the estimate by computation and modeling which is what you do for web systems. Conceptually, it's pretty simple to do this, although it can be a kind of pain in the ass.

The third way, which is new and cool, is by Bayesian estimation of failure rates. This method lets you make increasingly accurate estimates of the failure rate based on short experiments. I don't have time to go into it, but there are some good sources available on the web. [google.com]

Share
twitter facebook
- Re:MTBF calculation and estimation (Score:3, Informative)
  
  by crmartin ( 98227 ) writes:
  
  Actually, here's some more references: at CiteSeer [google.com], a good (if expensive) book on practical examples [amazon.com], and my favorite textbook [amazon.com]. I'll shut up now.
- - Re:MTBF calculation and estimation (Score:2)
    
    by crmartin ( 98227 ) writes:
    
    Thanks you. I will try to reign ineffectively from home, as all good geeks would prefer.
    
    Sadly, on some online geek test recently, I not only scored high, I scored #4 among everyone who had ever taken the test.
    
    Then I couldn't decide whether to be embarrassed at being that much of a geek, or at only scoring fourth.
- - Re:MTBF calculation and estimation (Score:2)
    
    by crmartin ( 98227 ) writes:
    
    Work it out; IF it's an exponential, THEN for failure rate parameter lambda, mean time between failures is 1/lambda. It's too painful to try to show this in HTML, but Trivedi's book cited above ("favorite text") shows the derivation.
    - - Re:MTBF calculation and estimation (Score:2)
        
        by crmartin ( 98227 ) writes:
        
        Right. Just an ambiguity in what you were saying: 'after time t'. You're giving failure in time (t-t0), PMF instead of PDF.
- Re:MTBF calculation and estimation (Score:2)
  
  by plsuh ( 129598 ) writes:
  
  AMEN! I distinctly regret that I don't have mod points right now. I have an ABD in Economics and years of work experience doing econometrics, and this analysis nails how the MTBF calculation is done exactly.
  
  One quibble I have (more with the HD manufacturers, not crmartin) is that HD have mechanical components (spindle, actuator, etc.) that are subject to wear. As a result, MTBF calculations that are appropriate for solid state electronic equipment not subject to physical wear are likely inappropriate for H
- Re:MTBF calculation and estimation (Score:3, Informative)
  
  by CharlieG ( 34950 ) writes:
  
  Of course, the real problem is that neither electronics, nor mechanical items have a exponential failure curve
  
  Mechanical items tend to fail due to wearout - aka, they become more likely to fail as time goes on
  
  Electronics follow a "Bathtub" curve. A high initial rate, that rapidly drops to a VERY low rate. It stays at that LOW rate of failure for MANY hours, and then the rate increases rapidly during "wearout" - sort of like the cross section of a bathtub - hence the name
  
  The whole concept of "Burn In" -
  - Re:MTBF calculation and estimation (Score:2)
    
    by crmartin ( 98227 ) writes:
    
    Sure, but in a "bathtub" distribution it's approximately exponential over most of the lifetime, so it's a decent approximation.
    - Re:MTBF calculation and estimation (Score:2)
      
      by CharlieG ( 34950 ) writes:
      
      Yes, it's exponential in the "Non-interesting" part of the curve. When you look at total failures (as a percentage), less than 10% of the total parts will fail during the "Bottom" of the curve, and the rest are fairly evenly split between the 2 sides. It's one of the reasons that companies can offer extended warrantees on electronics as cheaply as they do, and for it to be their greatest profit center
      
      Buy the product, use and abuse it during the original warrantee period, and it'll break if it's going to
      - Re:MTBF calculation and estimation (Score:2)
        
        by crmartin ( 98227 ) writes:
        
        I'm not quite sure why you read like we're arguing, since I don't think we are -- except to the extent that you don't thing the lengthy period of Markovian behavior is interesting, while I think that's the most interesting part.
        
        The point is this: the MTBF is computed for the Markovian part of the total life. The value for MTBF is computed as 1 / failure-rate IF AND ONLY IF the time distribution of failures is Markovian -- otherwise it's a more complicated function. The useful life is the length of time ov
- failure. (Score:1)
  
  by leuk_he ( 194174 ) writes:
  
  Pop quiz: if MTBF is a million years, how do you explain the occasional airframe failure, say, eg TWA 800? Hint: It doesn't require any foul play.)
  
  Let me try:(please reply if i am in the right direction)
  
  1.This 1 in a millon year does not count when the airplain is burning, or some other component failed.
  2. The airplane has lots of components. Suppose that if a door fails this could lead to failure of the airframe. if the plane has 3 doors the change goes down to 4 failures in a million year. LOTS of comp
  - Re:failure. (Score:2)
    
    by crmartin ( 98227 ) writes:
    
    "There are lots of planes and lots of years" is the right answer: I don't recall the exact figures right now, but at the time of the TWA 800 crash I predicted that it would turn out to be an airframe failure (on the heuristic of preferring failure to malice) because when I worked the numbers it turned out that MTBF of 747s was about that same 1e10.
    
    By the way, your supposition about answer (1) is correct. It's just a definitional thing: we're really talking about proximate and root causes. You don't count
Why is hardware always mean? (Score:5, Funny)

by cookd ( 72933 ) writes: <douglascook&juno,com> on Thursday June 19, 2003 @10:37PM (#6250097) Journal

Whatever happened to *NICE* time between failure?

Share
twitter facebook
- Re:Why is hardware always mean? (Score:1)
  
  by MarkGriz ( 520778 ) writes:
  
  Actually, the *NICE* time is before failure.
  The *MEAN* time isn't until after it fails.
It all depends on the distribution... (Score:3, Informative)

by anthony_dipierro ( 543308 ) writes: on Thursday June 19, 2003 @10:46PM (#6250152) Journal

I went to check some regular IDE drives just for comparison, and they were rated at 500,000 hours (57 years). Now, as I understand it, this is supposed to be the average time that you can expect the drive to last before failures. I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal?

Let's say I have a drive that has a 99% chance of failing after 10 years, and a 1% chance of failing after 4710 years. The MTBF is 57 years.

In fact, with the proper distribution (think 2^n) you could have an infinite MTBF, but still have a 99% chance of failure within 10 years. See for example the St. Petersburg paradox [stanford.edu].

Share
twitter facebook
- Re:It all depends on the distribution... (Score:2)
  
  by metamatic ( 202216 ) writes:
  
  Let's say I have a drive that has a 99% chance of failing after 1 year, and only a 1% chance of lasting for 10 years. There's a 99% chance I bought it from Micropolis.
You've come to the right place. (Score:3, Funny)

by Compact Dick ( 518888 ) writes: on Friday June 20, 2003 @08:36AM (#6252448) Homepage

I'm sure our resident expert [slashdot.org] is more than willing to help.

Share
twitter facebook
Design Lifetime (Score:3, Interesting)

by Detritus ( 11846 ) writes: on Friday June 20, 2003 @09:00AM (#6252611) Homepage

MTBF figures are usually associated with a design lifetime. That hard drive may have a 300,000 hour MTBF based on a 100% duty cycle and a 5 year design lifetime. That tells you the expected failure rate for the first 5 years of operation. After that point, the failure rate may increase rapidly.

Share
twitter facebook
All too simple (Score:2)

by agentZ ( 210674 ) writes:

I think we're looking too deeply into this concept. The Western Digital definition for MTBF [wdc.com] says, the MTBF "is calculated by dividing the total number of operating hours observed by the total number of failures. Also, the length of time a user may reasonably expect a device or system to work before an incapacitating fault occurs."

This means that they hook a whole bunch of drives up and run then for a while. Add the total hours of drive operation up and divide by the number of failures. They're estimate of
MTBF = infinity (Score:1)

by frink_exp ( 647091 ) writes:

I guess it depends on what you consider failure, but if it really fails then it can only happen once! The next failure is never coming 'cause there's no way it can fail again so MTBF = infinity
What MTBF Really means... (Score:1)

by giberti ( 110903 ) writes:

failures / (drives * hoursrun) = MTBF

Where:
failures = actual number of drives RETURNED to manufacture
drives = total number of production drives built
hours = actual failure point (about 5 years)
Common misconceptions... (Score:2)

by SagSaw ( 219314 ) writes:

For most items, the MTBF is not how long you can expect an item to operate with out failures. For most MTBF calculations, half of the units will fail before 33% or so of the MTBF. (I don't have the derivation of this number in front of me, but I can probably dig it up if somebody wants it).

As far as the high MTBF's mentioned by the submitter, I can think of at least two perfectly valid methods that accuratly determine long MTBF's:

First, you have the theoretical MTBF. This is where you look at the chan

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Duty Cycle (Score:4, Informative)

Re:Duty Cycle (Score:2)

Re:Duty Cycle (Score:3, Informative)

Re:Duty Cycle (Score:1)

Re:Duty Cycle (Score:2)

Re:Duty Cycle (Score:3, Informative)

Do it like the pros. (Score:1, Funny)

Re:Do it like the pros. (Score:2)

not just drives... (Score:4, Interesting)

Something else is going on . . . (Score:1)

Simple, it's called "lies" (Score:5, Funny)

You are wrong (Score:4, Informative)

As a sidenote (Score:1)

Re:As a sidenote (Score:5, Funny)

Re:As a sidenote (Score:2)

Re:As a sidenote (Score:1)

Re:As a sidenote (Score:2)

Re:As a sidenote (Score:4, Funny)

Sidestepping sidenotes (Score:1)

No, you are wrong (Score:4, Insightful)

Re:No, you are wrong (Score:1)

Re:No, you are wrong (Score:1)

Re:No, you are wrong (Score:2, Informative)

Re:No, you are wrong (Score:2)

Re:No, you are wrong (Score:1)

Re:No, you are wrong (Score:2)

Re:No, you are wrong (Score:1)

Re:No, you are wrong (Score:1)

Re:No, you are wrong (Score:1)

Yes, I am right. (Score:1)

No, you're wrong... (Score:1)

Re:No, you're wrong... (Score:2)

Re:Yes, I am right. (Score:2, Informative)

mod parent up - he's right (Score:2)

Re:mod parent up - he's right (Score:2)

Re:mod parent up - he's right (Score:1)

Re:mod parent up - he's right (Score:2)

Re:mod parent up - he's right (Score:1)

Re:mod parent up - he's right (Score:2)

Re:mod parent up - he's right (Score:1)

Re:mod parent up - he's right (Score:2)

Re:mod parent up - he's right (Score:1)

Re:mod parent up - he's right (Score:2)

Re:mod parent up - he's right (Score:1)

Re:mod parent up - he's right (Score:2)

Re:mod parent up - he's right (Score:3, Informative)

Re:mod parent up - he's right (Score:2)

No, You are wrong (Score:1, Informative)

Re:You are wrong (Score:2)

Re:You are wrong (Score:2)

Here's a wild-ass guess (Score:5, Insightful)

Re:Here's a wild-ass guess (Score:4, Interesting)

Look at the definition (Score:4, Interesting)

Re:Look at the definition (Score:2)

Labs (Score:4, Insightful)

Marketing BS... (Score:4, Insightful)

Ack!! The irony is too much. (Score:2)

Bad Example (Score:2)

Enter The Matrix... (Score:3, Funny)

MTBF... (Score:4, Insightful)

MTBF calculation and estimation (Score:5, Informative)

Re:MTBF calculation and estimation (Score:3, Informative)

Re:MTBF calculation and estimation (Score:2)

Re:MTBF calculation and estimation (Score:2)

Re:MTBF calculation and estimation (Score:2)

Re:MTBF calculation and estimation (Score:2)

Re:MTBF calculation and estimation (Score:3, Informative)

Re:MTBF calculation and estimation (Score:2)

Re:MTBF calculation and estimation (Score:2)

Re:MTBF calculation and estimation (Score:2)

failure. (Score:1)

Re:failure. (Score:2)

Why is hardware always mean? (Score:5, Funny)

Re:Why is hardware always mean? (Score:1)

It all depends on the distribution... (Score:3, Informative)

Re:It all depends on the distribution... (Score:2)

You've come to the right place. (Score:3, Funny)

Design Lifetime (Score:3, Interesting)

All too simple (Score:2)

MTBF = infinity (Score:1)