My final talk at Linux.conf.au 2013 was about “md” software RAID.
Slides are here and video is here (mp4).
One take away, mainly from conversations afterwards, is that – there is a perception that – it is not that uncommon for drives to fail in a way that causes them to return the wrong data without error. Thus using checksum per block, or 3-drive RAID1 with voting, or RAID6 with P/Q checks on every read might actually be a good idea. It is sad that such drives are not extremely uncommon, but it seems that it might be a reality.
What does one do when one finds such a drive? Fixing the “error” and continuing quietly seems like a mistake. Kicking the drive from the array is probably right, but might be too harsh. Stopping all IO and waiting for operator assistance is tempting…. but crazy.
I wonder…
Hi Neil, might want to update the video link to https://mirror.linux.org.au/pub/linux.conf.au/2013/mp4/RAID_is_more_than_parity_and_mirrors.mp4