facebook rss twitter

Manufacturers' MTTF Underestimates HDD Failure Rates

by Navin Maini on 4 March 2007, 14:39

Quick Link: HEXUS.net/qahz5

Add to My Vault: x

A study conducted by Carnegie Mellon University, concludes that hard disk drive failure rates are fifteen times higher than indicated by the MTTF information provided by manufacturers.

Using a sample of roughly 100,000 drives, the study which was presented last month at the 5th USENIX Conference in San Jose also found that SATA drives were no less reliable than the more expensive and faster Fibre Channel varieties.

Computerworld, reports that a further study presented at the same conference, which used drive samples from Google run data centres, interestingly found that drive temperatures appeared to have little effect on drive reliability rates.

The Carnegie Mellon University study debates that manufacturer data sheets for the drive samples used as part of the study, indicated MTTF values (Mean Time To Failure) of between 1 to 1.5 million hours, by which a conclusion was drawn that the worst case failure rate would equal some 0.88%. The study however uncovered that the failure rate of the drive samples used, which were from large production systems, Internet service sites and so on, was on an annual figure of somewhere between 2 and 4%. Some systems delivered failure rates of an astounding 13%.

Commenting on the study itself, the associate professor of computer science, as well as the co-author of the study itself, Garth Gibson, stressed that the goal of the study was to aid manufacturers to make improvements not only with the design of drives, but with the testing processes used as well. He went on to make clear that he had no vendor-specific material and that the study did not necessarily track actual drive failures but instead, customer diagnosed drive failure where the customer felt the drive in question required replacement. Lastly, he stated that helping users to distinguish between the best and worst vendors was not a goal of the study.

Mr. Gibson, went on to voice similar opinions held by analysts and vendors that perhaps as many as 50% of storage drives returned by customers had no failures and that failures in general could occur for a multitude of reasons, ranging from extraordinary environments which the drive may be subjected to, to random read-write or intensive operations that could simply cause premature wear and tear to the mechanical components within the drive.

Amongst the drive vendors who were asked to comment upon the study, several declined the invitation. A spokesperson from Seagate, based in California, responded via e-mail to express that 'The conditions that surround true drive failures are complicated and require a detailed failure analysis to determine what the failure mechanisms were'. Mirroring the information provided in the paragraph above, the spokesperson went on to state that 'It is important to not only understand the kind of drive being used, but the system or environment in which it was placed and its workload'.

However, perhaps not everyone was surprised at the results observed within the study. In particular, Ashish Nadkarni, holding the position of a principal consultant at Massachusetts based storage services provider, GlassHouse Technologies Inc., expressed no surprise at the replacement rates quoted because of the distinct differences between the environment used by drive vendors to test drives and, the dust, environment, noise and vibrations which may be present in a data centre.

Mr. Nadkarni elaborated further by describing how, in his view, the overall quality of drives, due to price competition within the industry, has been falling over time. He suggested that customers to implement tracking of disk drive records and to press vendors to review their internal testing procedures.

HEXUS.links

HEXUS.community :: Voice your opinion.


HEXUS Forums :: 7 Comments

Login with Forum Account

Don't have an account? Register today!
I never expected a million hours out of my drive anyway.
I had first time experience that, when using under-powered PSU about 3 yrs ago I have a drive failure every 2 months (out of half a dozen), after replacing the PSU I get one drive failure a year (out of a dozen).

Operating temperature have no effect is quite interesting, does that mean I can run my harddrives at 55'C and they will not age faster than at 40'C?
Not suprising really. Working in a computer repair shop we regularly see drives failing after 3/4 years….sooner in laptops. But on the other hand we sell machines with 98 on still going strong…

BUT the best way to keep your drive alive IMO is to make sure you've got enough ram - drives die much much earlier if their constantly acting as your virtual mem. This can be shown with the equation :

256mb + xp + Norton = dead hard drive

There was a particular batch of emachines from 2002/3 that shipped with xp and 128mb !! Those poor suckers died after 18 months.

oh, and i would avoid maxtor :D those suckers drop like mayflies


f
Something tells me the MTBF testing method involves testing in an airconditioned room with the drive powered on but only writing or reading once a day with a special casing to damped its vibrations. They are proabably also read bedtime stories and fed only the finest conditioned power.
arthurleung
Operating temperature have no effect is quite interesting, does that mean I can run my harddrives at 55'C and they will not age faster than at 40'C?
No, no - operating temperature have a limited effect, but within reason.

And frankly, if you're even getting a hard drive failure a year, you've got some serious problems. Or you keep buying Maxtors.

Even all 3 of my Western Digitals from 2001 are still working now, and until around September, they were used almost daily (first at home, and now in a RAID0 array in my machine at work, on 24/7 ).

funnelhead
BUT the best way to keep your drive alive IMO is to make sure you've got enough ram - drives die much much earlier if their constantly acting as your virtual mem. This can be shown with the equation :

256mb + xp + Norton = dead hard drive
:D :D

You could even take out the “256mb” and “XP” and that would be an equally valid equation.