[Lustre-discuss] slow recovery when MDS failed over

Brock Palen brockp at umich.edu
Thu Aug 7 09:06:18 PDT 2008


In doing some testing with our new hardware I did the following:

I rebooted the active MDS server, it failed over to the second one as  
expected.  While this was happening a client was reset.

When the MDS came up on the new server by heartbeat it went into  
recovery as expected.  The MDS now has been in recovery for 1.5  
hours.  I don't think this is normal.

What would cause this?  I know by having a client go down (the reset  
above) while the MDS is down but before recovery will cause recovery  
to time out but 1.5 hours is unacceptable time to wait for the file  
system to come back.

This is a stock 1.6.5.1 install.

cat recovery_status

status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/1
completed_clients: 0/1
replayed_requests: 0/??
queued_requests: 0
next_transno: 117

Did I some how wedge the file system?



Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985






More information about the lustre-discuss mailing list