I really want to applaud backblaze for publishing these reports and stats. Too m...

atYevP · on Sept 23, 2014

Yev from Backblaze -> Thanks! That really is one of our goals with these updates, is for others to join us and start sharing this data. It makes a lot more sense if everyone is doing it, then we can start comparing environments, and all sorts of fun stuff!

evntdrvn · on Sept 23, 2014

For the Seagate 7200.14, have you checked to see if you have the latest firmware on all of them? http://knowledge.seagate.com/articles/en_US/FAQ/223651en

Same for the 7200.11: http://knowledge.seagate.com/articles/en_US/FAQ/207951en

atYevP · on Sept 23, 2014

Yup, we have top-men that are in charge of keeping an eye on firmware, if we see a need in an update, we'll do it.

evntdrvn · on Sept 23, 2014

Have you been able to reach out to anyone at Seagate in their drive FA group to see if they'd be interested in you sending back samples of the failed drives?

atYevP · on Sept 24, 2014

We do have occasional talks with all hard drive manufacturers, but we can't say about what :)

mey · on Sept 24, 2014

I assume (mentioned in the article) they get sent back for warranty RMA.

atYevP · on Sept 24, 2014

Yes, we try to RMA all the drives that are still under warranty.

gioele · on Sept 24, 2014

> I purchased 1450 Hitachi drives.

Isn't using very similar drives a problem because their failure rates are not statistically independent, so there is an high probability that they will all fail at the same time?

Sami_Lehtinen · on Sept 24, 2014

I've seen it. RAID5 with 5 similar disks failed during weekend. All but one drive were dead. Of course disk failure also prevented daily backup run and all hell broke loose on Monday morning. The exact reason of death is unknown. Important fact is that the raid was configured on Thursday. So I guess the same time death is most probable when drives are really new and from same batch.

runamok · on Sept 24, 2014

I saw that once in an array that was on 24/7 for years. 1 drive was failing so they shut it down to replace the drive (perhaps hot swap was not an option?) and almost all of the rest of the drives did not come up. Basically the heads stuck to the platters. Aka "stiction".

I would guess in your scenario something like that happened. Or perhaps trying to migrate tons of data to a new disk caused an issue. Just seems unlikely otherwise.

kev009 · on Sept 24, 2014

In practice the only time I've ever heard of something like that bite are serious firmware bugs (i.e. a counter overflow causing drives to cycle after 45 days of uptime - source: I work for a major CDN with a six figure drive count).

There are some statistical properties I recall that will make mechanical failures spread somewhat nicely, maybe someone can elaborate if they have a background in stats.

disordr · on Sept 24, 2014

Yes, there is a concern in getting a bad batch, or having several drives fail all at once. Interestingly enough, while opening those 1450 anti-static cases and seeing the manufacture date of them, they were in separate batches (some with more dust on the cases than others), so we have some heterogeneous-ness on them. HDFS with its various levels of replication factors, and our backups/DR for high-value data takes care of the rest.

arthurcolle · on Sept 23, 2014

Damn how much did that set you back? How much space? SSDs or HDDs? Very curious. What kind of datasets are you analyzing with that much space?