Google’s Drive Study

February 19th, 2007 § 0 comments


I saw this post on Slash­dot the other day — it’s a paper called ” Fail­ure Trends in a Large Disk Drive Pop­u­la­tion”. It’s a good read for any­one in the stor­age busi­ness — hell, it’s a good read for any­one inter­ested in com­puter. In sec­tion 5, under con­clu­sions, they state:

In this study we report on the fail­ure char­ac­ter­is­tics of consumer-grade disk dri­ves. To our knowl­edge, the study is unprece­dented in that it uses a much larger pop­u­la­tion size than has been pre­vi­ously reported and presents a com­pre­hen­sive analy­sis of the cor­re­la­tion between fail­ures and sev­eral para­me­ters that are believed to affect disk life­time. Such analy­sis is made pos­si­ble by a new highly par­al­lel health data col­lec­tion and analy­sis infra­struc­ture, and by the sheer size of our com­put­ing deploy­ment.
One of our key find­ings has been the lack of a con­sis­tent pat­tern of higher fail­ure rates for higher tem­per­a­ture dri­ves or for those dri­ves at higher uti­liza­tion lev­els. Such cor­re­la­tions have been repeat­edly high­lighted by pre­vi­ous stud­ies, but we are unable to con­firm them by observ­ing our pop­u­la­tion. Although our data do not allow us to con­clude that there is no such cor­re­la­tion, it pro­vides strong evi­dence to sug­gest that other effects
may be more promi­nent in affect­ing disk drive reli­a­bil­ity in the con­text of a pro­fes­sion­ally man­aged data cen­ter deployment.

These two points are inter­est­ing. In some of the labs I’ve worked in, an aston­ish­ing num­ber of dri­ves die reg­u­larly. The manufacturer/distributor excuse has always been “heat issues” or “use cases”. Admit­tedly, the temp. range Google capped at was 50 cel­sius (122 Fahren­heit). In a rack with densely stacked servers (1-2U machines, rack filled) and with those machines run­ning close to 75% and above CPU load with non-stop disk I/O (read, write, delete/format) and con­stant machine power cycles the temp. inside the racks could spike far past the 122 mark at which point the failure-trend Google marks starts to spike again.

Of course, in the labs I’ve been in, we were using these as test bed machines — total/high reli­a­bil­ity was not some­thing direly impor­tant for the sim­ple fact that these machines were disposable.

Even with that in mind: You should always assume your disk dri­ves are going to fail sooner than you expect. The MTBF on a large enough pool of disks not con­fig­ured in a “smart” con­fig­u­ra­tion (i.e. raid, arrays, etc). I’m not talk­ing about consumer-use pat­terns (although, I just had a drive go south on my lap­top) — I’m talk­ing about datacenter/IT/etc use cases.

The Google paper is a good ref­er­ence case, but you should remem­ber that all use pat­terns are dif­fer­ent. An application/test or sys­tem that really puts the disks to use can cause drive fail­ures much ear­lier than you (or any paper) might assume. A good chunk of the “stor­age indus­try” real­ized this long ago — this is why com­pa­nies (cough) work on soft­ware appli­ca­tions and intel­li­gent hard­ware “wrap­pers” (arrays, raids, etc) to work around the basic assump­tion that in a large enough pool of dri­ves, you’re going to have near con­stant drive fail­ure. Peo­ple might dis­agree with the prices or method­ol­ogy, but the fact remains that the basic assump­tion is true.

Of course, that rea­son­ing can be held for any piece of hard­ware in the typ­i­cal data cen­ter. Apply too much heat/load to a pool of machines and your fail­ure rate it going to be high unless the machines were designed with high-reliability in mind (which nor­mal indi­cates RAID/Fiber/etc storage).

In any case, the paper is a good read. I’ve gone and started ram­bling. If you’re look­ing for some tools to test drives/filesystems in gen­eral, I’d take a look at the stan­dard Bonnie/Bonnie++ and other tools, but also take a look at Rugg (built in python) and also remem­ber that it’s impor­tant to stress a drive below the filesys­tem layer. Typ­i­cally, this means raw-writing to the device — if you’re job is to test drive speed/reliability or test the reli­a­bil­ity of drive dri­vers for your oper­at­ing sys­tem, that’s a step you can’t forget.

Update: Stor­age­Mojo has a more detailed break­down.

What's this?

You are currently reading Google’s Drive Study at jessenoller.com.

meta