[Lustre-discuss] Luster recovery when clients go away

Andreas Dilger adilger at sun.com
Thu Jul 31 10:55:09 PDT 2008


On Jul 31, 2008  10:30 -0400, Brock Palen wrote:
> One of our OSS's died with a panic last night.  Between when it was  
> found (no failover) and restarted two clients had died also.  (nodes  
> crashed by user OOM).
> 
> Because of this the OST's now are looking for 626  clients to recover  
> when only 624 are up.  So the 624 recover in about 15 minutes, but  
> the OST's on that OSS hang waiting for the last two that are dead and  
> not coming back.  Note the MDS reports only 624 clients.
> 
> Is there a a way to tell the OST's to go ahead and evict those two  
> clients and finish recovering?  Also "time remaining" has been 0  
> sense it was booted.  How long will the OST's wait before it lets  
> operations continue?
> 
> Is there any rule to speeding up recovery?  The OSS that crashed sees  
> very little cpus/disk/network traffic when recovery is going on so  
> any way to speed it up even if it results in a higher load would be  
> great to know.
> 
> status: RECOVERING
> recovery_start: 1217509142
> time remaining: 0
> connected_clients: 624/626
> completed_clients: 624/626
> replayed_requests: 0/??
> queued_requests: 0
> next_transno: 175342162
> status: RECOVERING
> recovery_start: 1217509144
> time remaining: 0
> connected_clients: 624/626
> completed_clients: 624/626
> replayed_requests: 0/??
> queued_requests: 0
> next_transno: 193097794

The recovery should time out after about 5 minutes (with default 100s
timeouts).  The recovery goes as fast as clients connect and submit RPCs
for replay.  In the case where all clients connect then recovery is
finished as soon as all clients report completion.

Are you saying the system is still stuck in recovery after more than 5
or 10 minutes?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list