Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I really want to applaud backblaze for publishing these reports and stats. Too many companies closely guard this information that really helps the larger community. Based on the previous blogs from backblaze, when I built out our new hadoop cluster, I purchased 1450 Hitachi drives. I plan to gather our failure rates and publish them as backblaze does. Thanks for blazing the path!


Yev from Backblaze -> Thanks! That really is one of our goals with these updates, is for others to join us and start sharing this data. It makes a lot more sense if everyone is doing it, then we can start comparing environments, and all sorts of fun stuff!


For the Seagate 7200.14, have you checked to see if you have the latest firmware on all of them? http://knowledge.seagate.com/articles/en_US/FAQ/223651en

Same for the 7200.11: http://knowledge.seagate.com/articles/en_US/FAQ/207951en


Yup, we have top-men that are in charge of keeping an eye on firmware, if we see a need in an update, we'll do it.


Have you been able to reach out to anyone at Seagate in their drive FA group to see if they'd be interested in you sending back samples of the failed drives?


We do have occasional talks with all hard drive manufacturers, but we can't say about what :)


I assume (mentioned in the article) they get sent back for warranty RMA.


Yes, we try to RMA all the drives that are still under warranty.


> I purchased 1450 Hitachi drives.

Isn't using very similar drives a problem because their failure rates are not statistically independent, so there is an high probability that they will all fail at the same time?


I've seen it. RAID5 with 5 similar disks failed during weekend. All but one drive were dead. Of course disk failure also prevented daily backup run and all hell broke loose on Monday morning. The exact reason of death is unknown. Important fact is that the raid was configured on Thursday. So I guess the same time death is most probable when drives are really new and from same batch.


I saw that once in an array that was on 24/7 for years. 1 drive was failing so they shut it down to replace the drive (perhaps hot swap was not an option?) and almost all of the rest of the drives did not come up. Basically the heads stuck to the platters. Aka "stiction".

I would guess in your scenario something like that happened. Or perhaps trying to migrate tons of data to a new disk caused an issue. Just seems unlikely otherwise.


In practice the only time I've ever heard of something like that bite are serious firmware bugs (i.e. a counter overflow causing drives to cycle after 45 days of uptime - source: I work for a major CDN with a six figure drive count).

There are some statistical properties I recall that will make mechanical failures spread somewhat nicely, maybe someone can elaborate if they have a background in stats.


Yes, there is a concern in getting a bad batch, or having several drives fail all at once. Interestingly enough, while opening those 1450 anti-static cases and seeing the manufacture date of them, they were in separate batches (some with more dust on the cases than others), so we have some heterogeneous-ness on them. HDFS with its various levels of replication factors, and our backups/DR for high-value data takes care of the rest.


Damn how much did that set you back? How much space? SSDs or HDDs? Very curious. What kind of datasets are you analyzing with that much space?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: