[Lustre-discuss] Luster recovery when clients go away

Brock Palen brockp at umich.edu
Thu Jul 31 07:30:04 PDT 2008


One of our OSS's died with a panic last night.  Between when it was  
found (no failover) and restarted two clients had died also.  (nodes  
crashed by user OOM).

Because of this the OST's now are looking for 626  clients to recover  
when only 624 are up.  So the 624 recover in about 15 minutes, but  
the OST's on that OSS hang waiting for the last two that are dead and  
not coming back.  Note the MDS reports only 624 clients.

Is there a a way to tell the OST's to go ahead and evict those two  
clients and finish recovering?  Also "time remaining" has been 0  
sense it was booted.  How long will the OST's wait before it lets  
operations continue?

Is there any rule to speeding up recovery?  The OSS that crashed sees  
very little cpus/disk/network traffic when recovery is going on so  
any way to speed it up even if it results in a higher load would be  
great to know.

status: RECOVERING
recovery_start: 1217509142
time remaining: 0
connected_clients: 624/626
completed_clients: 624/626
replayed_requests: 0/??
queued_requests: 0
next_transno: 175342162
status: RECOVERING
recovery_start: 1217509144
time remaining: 0
connected_clients: 624/626
completed_clients: 624/626
replayed_requests: 0/??
queued_requests: 0
next_transno: 193097794



Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985






More information about the lustre-discuss mailing list