Calculating the Mean Time Between Failures? 100
Blue Booger asks: "I was looking over some fibrechannel hard drives and noticed that the Mean Time Between Failures was rated at 1.2 million hours. I thought that was pretty high, and figured it up to be close to 137 YEARS!! I went to check some regular IDE drives just for comparison, and they were rated at 500,000 hours (57 years). Now, as I understand it, this is supposed to be the average time that you can expect the drive to last before failures. I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal? BTW, that is 57 years running 24 hours a day...the MTBF is rated as power on time. Here you can find Western Digital's glossary that defines the term MTBF (pdf). Here you can find a spec sheet on one of their 20GB IDE drives. I checked, and Seagate also lists similar MTBFs. How the heck are they coming up with these numbers?"
Duty Cycle (Score:4, Informative)
Re:Duty Cycle (Score:2)
Re:Duty Cycle (Score:3, Informative)
Re:Duty Cycle (Score:1)
Re:Duty Cycle (Score:2)
Re:Duty Cycle (Score:3, Informative)
No, I'm not kidding. Some heads rolled over that...
Do it like the pros. (Score:1, Funny)
Like, my hrad drive has a MTBF of 300,000 hours.
Re:Do it like the pros. (Score:2)
not just drives... (Score:4, Interesting)
Something else is going on . . . (Score:1)
Either you just happened to get a bad batch, or you've got environmental problems. Make sure there is sufficient air flow around the units, and check the power harmonics on the circuit. Most consumer grade (read cheap) electronics uses crappy power supplys which cause harmonics on the power line. One or two isn't a big
Simple, it's called "lies" (Score:5, Funny)
You are wrong (Score:4, Informative)
If you have twenty drives with twenty years MTFB (Mean Time Between Failures) each, then you have one failure per year on average. These are basic statistics fighting always against you.
As a sidenote (Score:1)
Re:As a sidenote (Score:5, Funny)
Shouldn't that be "The views expressed hereafter are not necessarily those of MENSA, of which I am only a member." I would think proper grammar usage would be a prerequisite for being a MENSA member.
Re:As a sidenote (Score:2)
What of the pomposity of foisting membership in a group of self-righteous egotists on a public forum? Over-compensation, anyone?
The only problem with her scheme is that Mensa membership is easy to come by. Some people just find it an undesirable association. But if she feels she has to emphasize her membership because perhaps her posts can't stand on their own me
Re:As a sidenote (Score:1)
Re:As a sidenote (Score:2)
Maybe it was just that particular chapter but I got the distinct impression that they were a bunch of egotistical losers!
Re:As a sidenote (Score:4, Funny)
<input type="radio" name="gift" value="money" disabled>
<input type="radio" name="gift" value="penis size" disabled>
<input type="radio" name="gift" value="ability to nitpick trivia" checked>
Sidestepping sidenotes (Score:1)
As for the language bit, ALOT of really intelligent people are totally dislexic when it comes to grammar and spelling, since it makes no sense anyway IMHO
No, you are wrong (Score:4, Insightful)
Re:No, you are wrong (Score:1)
Is there some special definition for MTBF that changes how "mean time between" is interpreted?
Re:No, you are wrong (Score:1)
Re:No, you are wrong (Score:2, Informative)
I would have called your average "mean time to failure" vs. "mean time between failures."
But as other posters mentioned, most of these stats are marketing bunk no matter how they're computed!
Re:No, you are wrong (Score:2)
Re:No, you are wrong (Score:1)
Re:No, you are wrong (Score:2)
Re:No, you are wrong (Score:1)
I would have called your average "mean time to failure" vs. "mean time between failures."
Or better yet, "mean time before failure," thus preserving the acronym.
Re:No, you are wrong (Score:1)
Re:No, you are wrong (Score:1)
If by some special definition you mean a simple linear multiplication (or division, depanding on your point of view), then yes. Anthony Dipierro probably was mistakenly thinking about a decibel or other logarithmically scaled unit system.
Yes, I am right. (Score:1)
If I have twenty drives, each of which is estimated to fail once in a twenty-year period, then in such a twenty-year period every one of those disks is estimated to fail once. These are twenty failures in twenty-year period on average, id est one failure per year. It is actually a matter of very simple mathematics.
No, you're wrong... (Score:1)
If I have twenty drives, each of which is estimated to fail once in a twenty-year period, then in such a twenty-year period every one of those disks is estimated to fail once.
No, you're wrong. The average time to failure is 20 years, not the maximum.
The average of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, and 20 is 10.5.
Re:No, you're wrong... (Score:2)
That might be valid when you work in a datacenter that replaces x number of drives every year.
On average, if the MTBF of those drives is 20 years, one drive in such a group will fail every year.
Re:Yes, I am right. (Score:2, Informative)
If they all failed within 20 years, how would the average disk have failed in 20 years???
20 drives/2
mod parent up - he's right (Score:2)
A quick thought experiment should make it obvious that for a series of numbers (eg number of years between failure) where the highest number is n, that the mean of these numbers could never be n.
Re:mod parent up - he's right (Score:2)
Re:mod parent up - he's right (Score:1)
Hmm... Actually I just thought of something. If you have 20 hard drives, and one fails each year, but you fix it immediately after failure and then put it into service again, then the MTBF would be 20 years.
I guess that's why the term is the mean time between failures rather than the mean time before failure. Back when the term was invented it probably made economic sense to fix a hard drive when it breaks. Nowadays we're more likely to just throw it away.
Re:mod parent up - he's right (Score:2)
Re:mod parent up - he's right (Score:1)
Yeah, but the total operating time for each drive is 20 years if you fix them and they last. What I'm saying is, one drive breaks after 1 year, you fix it and it lasts the next 19, one drive breaks after 2 years, you fix it and it lasts the next 18, etc.
Alternatively, you could have 19 of the drives last 20 years, and the other one break once a year.
Actually, you wouldn't even have to fix them, if you replace them. If you keep 20 drives for 20 years, and one breaks every year, if you replace the one th
Re:mod parent up - he's right (Score:2)
Re:mod parent up - he's right (Score:1)
if it fails after a year, and you fix it and it lasts another 19 years, MTBF is (1+19)/2 = 10. Its failed twice on you.
No, I never said it failed a second time. I said it failed after a year, then it lasted another 19 years. Then you stopped the experiment.
Re:mod parent up - he's right (Score:2)
I think you are right in this case, even with that consideration.
Since every drive failed once within a 20-year period, then we have all of our data points on the lower half of the normal curve (with the mean of 20 being in the middle), and none above. Thus, the mean for this part of the experiment would actually be around 10 years. However, if we were to continue the experiment out to 40 years or longer, and gain more samples, then it might balance out, but those repaired drives would have to be
Re:mod parent up - he's right (Score:1)
Re:mod parent up - he's right (Score:2)
Re:mod parent up - he's right (Score:1)
That's the difference between "statistical inference" and what people who have had a probability class tend to do.
I don't know if it matters as much whether or not the person has had a probability class as whether or not the person is trying to legally boost MTBF ratings.
Sure, you can get better figures using different methods, but those better figures most likely will be lower, so why bother if you want to sell your product.
Re:mod parent up - he's right (Score:2)
Re:mod parent up - he's right (Score:3, Informative)
Whatever your math may say, the Industry's standard is the one you disagree with. If they stick 1000 drives in an array, run them for 1 year, and only a single drive fails during that year, the MTBF for that model of drive is 500 years.
Re:mod parent up - he's right (Score:2)
The original post that started this thread (from mensa babe) stated "run 20 drives of MTBF 20 years, then one fails each year", a poster replied to say they were incorrect, it would mean drives had MTBF of 10.5 years. And I replied to say the correctee was right. (which they are).
Your example is for 1000 drives, running for 1 year, 1 failure. Going by the same math which i used to back up the aformentioned correctee, ie MTBF = Sigma(runtime)/failures, which is
No, You are wrong (Score:1, Informative)
Re:You are wrong (Score:2)
Second, manufacturer-specified MTBF has nothing to do with a simple, observational mean. It is a calculated statistic based on a statistical sample and extrapolated over a normal distribution. In real-life terms, it is close to meaningless.
Basically, it does not mean "Mean Time Between observed Failur
Re:You are wrong (Score:2)
If you are assuming that a drive is repaired and placed back into service, and then fails again after another random 1-20 year period, then you would be correct.
I think the problem is that most of the geeks think in terms of practicality, and treat the situation as MTTF ("Mean-Time-To-Failure") instead of MTBF, since we all know that MTBF is BS anyway.
As a result, I clarify the statement above, myself assuming "simple MTTF", not "simple MTBF".
Here's a wild-ass guess (Score:5, Insightful)
For example, shipped 2 million drives last year, each ran 2080 hours ( 8 hours * 52 weeks), roughly 4 trillion hours total. Out of those 2 million units, they got 3466 returns. So the average MTBF was 1.2 million hours.
Re:Here's a wild-ass guess (Score:4, Interesting)
Look at the definition (Score:4, Interesting)
Unfortunately, that equation doesn't take into account the fact that some equipment degrades over time; if a product is very reliable for 1,000 hours, and less reliable after that, just double the sample size (maybe triple for statistics), and see what you get.
Real reliability calculations are much more difficult than just what users think MTBF means...
Re:Look at the definition (Score:2)
MBTF might have a little bit of value as a relative measure -- i.e. perhaps drives with an order of magnitude higher MBTF will last longer. It's a lot less useful as an absolute measure.
Lots of equipment (not just hard drives) has some sort of estimated lifetime. The best the manufacturer can ever do is estimate under semi-realistic conditions and extrapolate.
Labs (Score:4, Insightful)
MTBF is probably determined by taking a bunch of drives, putting them into PERFECT conditions that NEVER exist in the real world. Run them in a way that, although test all functionality, really doesn't provide true conditions for drives (IE head always reading/writing up and down the disk probably never seeking, disks always spinning, etc..). Something that drives never do in real life. Statistics...statistics...statistics...(speeling too
Marketing BS... (Score:4, Insightful)
Somebody ought to sue them for deceptive advertisement.
Ack!! The irony is too much. (Score:2)
float y = (float) 1/2 * x;
Bad Example (Score:2)
Enter The Matrix... (Score:3, Funny)
I prefer to measure time by the emergence of one integral anomoly to the next.
MTBF... (Score:4, Insightful)
MTBF calculation and estimation (Score:5, Informative)
Okay, first of all: "mean time between failures" is obviously a statistical measure -- it is an average over a large number of individual items. In most electronic components (including light bulbs!) the statistical distribution of the time between failures is the exponential distribution [wolfram.com], which has the odd property that it's "memory-less" -- it doesn't matter how long since the last failure it's been, the mean time to the next failure will still be the same. A consequence of this is that if the MTBF is 10,000 hours, the probability of failure in any particular hour would be 1/10,000th. So, if you set up 10,000 components, all running simultaneously, you'd expect one of them to fail within the first hour; conversely, if you ran them for 1000 hours, and 998 of them failed, you could be fairly certain that the MTBF would be around 10,000 hours.
Note, by the way, that this is only true when the failure time distribution is exponential -- so it works for electronic components, but not for, say, bicycles and cars and roller skates, which are more likely to fail the older they get.
This has an obvious problem, of course: if the MTBF is high, it can take forever to test. Consider, for example, something I worked on for NASA some years ago: trying to prove that a fly-by-wire system will have a mean time between failures of 1e10 hours. (This is about the same failure rte as the airframe, which is how they came up with the number.) 1e10 hours is about 1.141 million years, by the way.
(Pop quiz: if MTBF is a million years, how do you explain the occasional airframe failure, say, eg TWA 800? Hint: It doesn't require any foul play.)
At that point, you've got a couple of choices: first, you can make a lot of copies and run them simultaneously. Relatively easy for $50 disks, hard for billion dollar 747s.
Second, you can make the estimate by computation and modeling which is what you do for web systems. Conceptually, it's pretty simple to do this, although it can be a kind of pain in the ass.
The third way, which is new and cool, is by Bayesian estimation of failure rates. This method lets you make increasingly accurate estimates of the failure rate based on short experiments. I don't have time to go into it, but there are some good sources available on the web. [google.com]
Re:MTBF calculation and estimation (Score:3, Informative)
Re:MTBF calculation and estimation (Score:2)
Sadly, on some online geek test recently, I not only scored high, I scored #4 among everyone who had ever taken the test.
Then I couldn't decide whether to be embarrassed at being that much of a geek, or at only scoring fourth.
Re:MTBF calculation and estimation (Score:2)
Re:MTBF calculation and estimation (Score:2)
Re:MTBF calculation and estimation (Score:2)
One quibble I have (more with the HD manufacturers, not crmartin) is that HD have mechanical components (spindle, actuator, etc.) that are subject to wear. As a result, MTBF calculations that are appropriate for solid state electronic equipment not subject to physical wear are likely inappropriate for H
Re:MTBF calculation and estimation (Score:3, Informative)
Mechanical items tend to fail due to wearout - aka, they become more likely to fail as time goes on
Electronics follow a "Bathtub" curve. A high initial rate, that rapidly drops to a VERY low rate. It stays at that LOW rate of failure for MANY hours, and then the rate increases rapidly during "wearout" - sort of like the cross section of a bathtub - hence the name
The whole concept of "Burn In" -
Re:MTBF calculation and estimation (Score:2)
Re:MTBF calculation and estimation (Score:2)
Buy the product, use and abuse it during the original warrantee period, and it'll break if it's going to
Re:MTBF calculation and estimation (Score:2)
The point is this: the MTBF is computed for the Markovian part of the total life. The value for MTBF is computed as 1 / failure-rate IF AND ONLY IF the time distribution of failures is Markovian -- otherwise it's a more complicated function. The useful life is the length of time ov
failure. (Score:1)
Let me try:(please reply if i am in the right direction)
1.This 1 in a millon year does not count when the airplain is burning, or some other component failed.
2. The airplane has lots of components. Suppose that if a door fails this could lead to failure of the airframe. if the plane has 3 doors the change goes down to 4 failures in a million year. LOTS of comp
Re:failure. (Score:2)
By the way, your supposition about answer (1) is correct. It's just a definitional thing: we're really talking about proximate and root causes. You don't count
Why is hardware always mean? (Score:5, Funny)
Re:Why is hardware always mean? (Score:1)
The *MEAN* time isn't until after it fails.
It all depends on the distribution... (Score:3, Informative)
I went to check some regular IDE drives just for comparison, and they were rated at 500,000 hours (57 years). Now, as I understand it, this is supposed to be the average time that you can expect the drive to last before failures. I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal?
Let's say I have a drive that has a 99% chance of failing after 10 years, and a 1% chance of failing after 4710 years. The MTBF is 57 years.
In fact, with the proper distribution (think 2^n) you could have an infinite MTBF, but still have a 99% chance of failure within 10 years. See for example the St. Petersburg paradox [stanford.edu].
Re:It all depends on the distribution... (Score:2)
You've come to the right place. (Score:3, Funny)
Design Lifetime (Score:3, Interesting)
All too simple (Score:2)
This means that they hook a whole bunch of drives up and run then for a while. Add the total hours of drive operation up and divide by the number of failures. They're estimate of
MTBF = infinity (Score:1)
What MTBF Really means... (Score:1)
Where:
failures = actual number of drives RETURNED to manufacture
drives = total number of production drives built
hours = actual failure point (about 5 years)
Common misconceptions... (Score:2)
As far as the high MTBF's mentioned by the submitter, I can think of at least two perfectly valid methods that accuratly determine long MTBF's:
First, you have the theoretical MTBF. This is where you look at the chan