My final talk at Linux.conf.au 2013 was about “md” software RAID.
One take away, mainly from conversations afterwards, is that – there is a perception that – it is not that uncommon for drives to fail in a way that causes them to return the wrong data without error. Thus using checksum per block, or 3-drive RAID1 with voting, or RAID6 with P/Q checks on every read might actually be a good idea. It is sad that such drives are not extremely uncommon, but it seems that it might be a reality.
What does one do when one finds such a drive? Fixing the “error” and continuing quietly seems like a mistake. Kicking the drive from the array is probably right, but might be too harsh. Stopping all IO and waiting for operator assistance is tempting…. but crazy.