[Lustre-discuss] software raid

Oleg Drokin green at whamcloud.com
Mon Mar 28 14:56:42 PDT 2011


Hello!

On Mar 28, 2011, at 4:43 PM, Lundgren, Andrew wrote:
> 
> When you reboot a machine that has a failed disk in the array (degraded), the array will not start by default in a degraded state.  If you have LVMs on top of your raid arrays, they will also not start.  You will need to log into the machine, manually force start the array in a degraded state and then manually start the LVM on top of the SW raid array.

I am with you on everything but this point.
In my experience Linux SW raid does start when the array is degraded. Unless you have --no-degraded as default mdadm option, of course.
There is a subtle case when it does behave strange and I see it on just one of my nodes, this is when all devices claim they were stopped cleanly yet they disagree about number of events processed.
In this case the array still starts in degraded mode, but the one disk that has the outlying event counter is kicked from the array and is not rebuilt until you manually re-add it back.
I have seen it only with RAID5 so far and the theory is that a disk controller (or the disks themselves?) in that particular node is bad and does not flush it's cache when asked and on power off.
Of course if you miss this degraded state and don't re-add anything thee is a chance on next reboot the two remaining disks will get out of sync as well and then the array will fail to start completely.
Surprisingly what totally fixed this issue for me was enabling bitmaps (of course if you don't want to have negative performance impact of those you need to set them up on a separate device).

Bye,
    Oleg


More information about the lustre-discuss mailing list