RAID – not just smoke and mirrors

My final talk at 2013 was about “md” software RAID.

Slides are here and video is here (mp4).

One take away, mainly from conversations afterwards, is that – there is a perception that – it is not that uncommon for drives to fail in a way that causes them to return the wrong data without error.  Thus using checksum per block, or 3-drive RAID1 with voting, or RAID6 with P/Q checks on every read might actually be a good idea.  It is sad that such drives are not extremely uncommon, but it seems that it might be a reality.

What does one do when one finds such a drive?  Fixing the “error” and continuing quietly seems like a mistake.  Kicking the drive from the array is probably right, but might be too harsh. Stopping all IO and waiting for operator assistance is tempting…. but crazy.

I wonder…

