[Lustre-discuss] Luster recovery when clients go away
Brock Palen
brockp at umich.edu
Thu Jul 31 07:30:04 PDT 2008
One of our OSS's died with a panic last night. Between when it was
found (no failover) and restarted two clients had died also. (nodes
crashed by user OOM).
Because of this the OST's now are looking for 626 clients to recover
when only 624 are up. So the 624 recover in about 15 minutes, but
the OST's on that OSS hang waiting for the last two that are dead and
not coming back. Note the MDS reports only 624 clients.
Is there a a way to tell the OST's to go ahead and evict those two
clients and finish recovering? Also "time remaining" has been 0
sense it was booted. How long will the OST's wait before it lets
operations continue?
Is there any rule to speeding up recovery? The OSS that crashed sees
very little cpus/disk/network traffic when recovery is going on so
any way to speed it up even if it results in a higher load would be
great to know.
status: RECOVERING
recovery_start: 1217509142
time remaining: 0
connected_clients: 624/626
completed_clients: 624/626
replayed_requests: 0/??
queued_requests: 0
next_transno: 175342162
status: RECOVERING
recovery_start: 1217509144
time remaining: 0
connected_clients: 624/626
completed_clients: 624/626
replayed_requests: 0/??
queued_requests: 0
next_transno: 193097794
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
More information about the lustre-discuss
mailing list