Google’s Drive Study

by jesse in ,

I saw this post on Slashdot the other day - it's a paper called " Failure Trends in a Large Disk Drive Population". It's a good read for anyone in the storage business - hell, it's a good read for anyone interested in computer. In section 5, under conclusions, they state:

In this study we report on the failure characteristics of consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger population size than has been previously reported and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime. Such analysis is made possible by a new highly parallel health data collection and analysis infrastructure, and by the sheer size of our computing deployment. One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.

These two points are interesting. In some of the labs I've worked in, an astonishing number of drives die regularly. The manufacturer/distributor excuse has always been "heat issues" or "use cases". Admittedly, the temp. range Google capped at was 50 celsius (122 Fahrenheit). In a rack with densely stacked servers (1-2U machines, rack filled) and with those machines running close to 75% and above CPU load with non-stop disk I/O (read, write, delete/format) and constant machine power cycles the temp. inside the racks could spike far past the 122 mark at which point the failure-trend Google marks starts to spike again.

Of course, in the labs I've been in, we were using these as test bed machines - total/high reliability was not something direly important for the simple fact that these machines were disposable.

Even with that in mind: You should always assume your disk drives are going to fail sooner than you expect. The MTBF on a large enough pool of disks not configured in a "smart" configuration (i.e. raid, arrays, etc). I'm not talking about consumer-use patterns (although, I just had a drive go south on my laptop) - I'm talking about datacenter/IT/etc use cases.

The Google paper is a good reference case, but you should remember that all use patterns are different. An application/test or system that really puts the disks to use can cause drive failures much earlier than you (or any paper) might assume. A good chunk of the "storage industry" realized this long ago - this is why companies (cough) work on software applications and intelligent hardware "wrappers" (arrays, raids, etc) to work around the basic assumption that in a large enough pool of drives, you're going to have near constant drive failure. People might disagree with the prices or methodology, but the fact remains that the basic assumption is true.

Of course, that reasoning can be held for any piece of hardware in the typical data center. Apply too much heat/load to a pool of machines and your failure rate it going to be high unless the machines were designed with high-reliability in mind (which normal indicates RAID/Fiber/etc storage).

In any case, the paper is a good read. I've gone and started rambling. If you're looking for some tools to test drives/filesystems in general, I'd take a look at the standard Bonnie/Bonnie++ and other tools, but also take a look at Rugg (built in python) and also remember that it's important to stress a drive below the filesystem layer. Typically, this means raw-writing to the device - if you're job is to test drive speed/reliability or test the reliability of drive drivers for your operating system, that's a step you can't forget.

Update: StorageMojo has a more detailed breakdown.