[Lustre-discuss] software raid

Lundgren, Andrew Andrew.Lundgren at Level3.com
Mon Mar 28 15:17:53 PDT 2011


We have also had a few kernel panic's at the same time as a failed disk. I don't know what was first but anecdotally, it seems that we might be seeing an occasional kernel panic with a disk failure on swraid...  Though that is still just FUD, so don't put stock in it unless you see it.

-----Original Message-----
From: Oleg Drokin [mailto:green at whamcloud.com] 
Sent: Monday, March 28, 2011 3:57 PM
To: Lundgren, Andrew
Cc: Brian O'Connor; lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] software raid

Hello!

On Mar 28, 2011, at 4:43 PM, Lundgren, Andrew wrote:
> 
> When you reboot a machine that has a failed disk in the array (degraded), the array will not start by default in a degraded state.  If you have LVMs on top of your raid arrays, they will also not start.  You will need to log into the machine, manually force start the array in a degraded state and then manually start the LVM on top of the SW raid array.

I am with you on everything but this point.
In my experience Linux SW raid does start when the array is degraded. Unless you have --no-degraded as default mdadm option, of course.
There is a subtle case when it does behave strange and I see it on just one of my nodes, this is when all devices claim they were stopped cleanly yet they disagree about number of events processed.
In this case the array still starts in degraded mode, but the one disk that has the outlying event counter is kicked from the array and is not rebuilt until you manually re-add it back.
I have seen it only with RAID5 so far and the theory is that a disk controller (or the disks themselves?) in that particular node is bad and does not flush it's cache when asked and on power off.
Of course if you miss this degraded state and don't re-add anything thee is a chance on next reboot the two remaining disks will get out of sync as well and then the array will fail to start completely.
Surprisingly what totally fixed this issue for me was enabling bitmaps (of course if you don't want to have negative performance impact of those you need to set them up on a separate device).

Bye,
    Oleg



More information about the lustre-discuss mailing list