[Lustre-discuss] slow recovery when MDS failed over

Brock Palen brockp at umich.edu
Mon Aug 18 20:07:47 PDT 2008


Something appeared to be messed up.  We rebuilt the filesystem and  
now we cant reproduce the problem.
Thanks for looking into it.

I am doing some failover testing right now, see my other emails.  Now  
that I have the MGS seen as two hosts, failover is quite snappy for a  
known failover,  IE reboot on active MDS, heartbeat does what it should.

Recovery from yanking power (ipmitool chassis power rest)  takes a  
little longer but still quite fast.
I am much happier with lustre failover than I was a few days ago.  My  
own personal growing pains.

Thanks again for looking into this.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Aug 18, 2008, at 11:02 PM, Andreas Dilger wrote:
> On Aug 07, 2008  12:06 -0400, Brock Palen wrote:
>> When the MDS came up on the new server by heartbeat it went into
>> recovery as expected.  The MDS now has been in recovery for 1.5
>> hours.  I don't think this is normal.
>>
>> What would cause this?  I know by having a client go down (the reset
>> above) while the MDS is down but before recovery will cause recovery
>> to time out but 1.5 hours is unacceptable time to wait for the file
>> system to come back.
>
> The recovery should time out in about 5 minutes if the clients do not
> reply.  Something is definitely wrong.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
>




More information about the lustre-discuss mailing list