[Lustre-discuss] Too many client eviction

Andreas Dilger adilger at whamcloud.com
Tue May 3 09:42:17 PDT 2011


I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. I believe that LLNL has some adjusted tunables for AT that might help for you (increased at_min, etc).

Hopefully Chris or someone at LLNL can comment. I think they were also documented in bugzilla, though I don't know the bug number. 

Cheers, Andreas

On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien <aurelien.degremont at cea.fr> wrote:

> Hello
> 
> We often see some of our Lustre clients being evicted abusively (clients 
> seem healthy).
> The pattern is always the same:
> 
> All of this on Lustre 2.0, with adaptative timeout enabled
> 
> 1 - A server complains about a client :
> ### lock callback timer expired... after 25315s...
> (nothing on client)
> 
> (few seconds later)
> 
> 2 - The client receives -107 to a obd_ping for this target
> (server says "@@@processing error 107")
> 
> 3 - Client realize its connection was lost.
> Client notices it was evicted.
> It reconnects.
> 
> (To be sure) When client is evicted, all undergoing I/O are lost, no 
> recovery will be done for that?
> 
> We are thinking to increase timeout to give more time to clients to 
> answer the ldlm revocation.
> (maybe it is just too loaded)
> - Is ldlm_timeout enough to do so?
> - Do we need to also change obd_timeout in accordance? Is there a risk 
> to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
> 
> Any feedback in this area is welcomed.
> 
> Thank you
> 
> Aurélien Degrémont
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss



More information about the lustre-discuss mailing list