[Lustre-discuss] slow recovery when MDS failed over
Brock Palen
brockp at umich.edu
Thu Aug 7 09:06:18 PDT 2008
In doing some testing with our new hardware I did the following:
I rebooted the active MDS server, it failed over to the second one as
expected. While this was happening a client was reset.
When the MDS came up on the new server by heartbeat it went into
recovery as expected. The MDS now has been in recovery for 1.5
hours. I don't think this is normal.
What would cause this? I know by having a client go down (the reset
above) while the MDS is down but before recovery will cause recovery
to time out but 1.5 hours is unacceptable time to wait for the file
system to come back.
This is a stock 1.6.5.1 install.
cat recovery_status
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/1
completed_clients: 0/1
replayed_requests: 0/??
queued_requests: 0
next_transno: 117
Did I some how wedge the file system?
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
More information about the lustre-discuss
mailing list