[Lustre-discuss] slow recovery when MDS failed over
Brock Palen
brockp at umich.edu
Mon Aug 18 20:07:47 PDT 2008
Something appeared to be messed up. We rebuilt the filesystem and
now we cant reproduce the problem.
Thanks for looking into it.
I am doing some failover testing right now, see my other emails. Now
that I have the MGS seen as two hosts, failover is quite snappy for a
known failover, IE reboot on active MDS, heartbeat does what it should.
Recovery from yanking power (ipmitool chassis power rest) takes a
little longer but still quite fast.
I am much happier with lustre failover than I was a few days ago. My
own personal growing pains.
Thanks again for looking into this.
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
On Aug 18, 2008, at 11:02 PM, Andreas Dilger wrote:
> On Aug 07, 2008 12:06 -0400, Brock Palen wrote:
>> When the MDS came up on the new server by heartbeat it went into
>> recovery as expected. The MDS now has been in recovery for 1.5
>> hours. I don't think this is normal.
>>
>> What would cause this? I know by having a client go down (the reset
>> above) while the MDS is down but before recovery will cause recovery
>> to time out but 1.5 hours is unacceptable time to wait for the file
>> system to come back.
>
> The recovery should time out in about 5 minutes if the clients do not
> reply. Something is definitely wrong.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
>
More information about the lustre-discuss
mailing list