[Lustre-discuss] Too many client eviction

Tue May 3 12:41:21 PDT 2011

On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote:

> Correct me if I'm wrong, but when I'm looking at Lustre manual, it said 
> that client is adapting its timeout, but not the server. I'm understood 
> that server->client RPC still use the old mechanism, especially for our 
> case where it seems server is revoking a client lock (ldlm_timeout is 
> used for that?) and client did not respond.

Server and client cooperate together for the adaptive timeouts.  I don't remember which bug the ORNL settings were in, maybe 14071, bugzilla's not responding at the moment.  But a big question here is why 25315 seconds for a callback - that's well beyond anything at_max should allow...

> 
> I forgot to say that we have LNET routers also involved for some cases.
> 
> Thank you
> 
> Aurélien
> 
> Andreas Dilger a écrit :
>> I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. I believe that LLNL has some adjusted tunables for AT that might help for you (increased at_min, etc).
>> 
>> Hopefully Chris or someone at LLNL can comment. I think they were also documented in bugzilla, though I don't know the bug number. 
>> 
>> Cheers, Andreas
>> 
>> On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien <aurelien.degremont at cea.fr> wrote:
>> 
>> 
>>> Hello
>>> 
>>> We often see some of our Lustre clients being evicted abusively (clients 
>>> seem healthy).
>>> The pattern is always the same:
>>> 
>>> All of this on Lustre 2.0, with adaptative timeout enabled
>>> 
>>> 1 - A server complains about a client :
>>> ### lock callback timer expired... after 25315s...
>>> (nothing on client)
>>> 
>>> (few seconds later)
>>> 
>>> 2 - The client receives -107 to a obd_ping for this target
>>> (server says "@@@processing error 107")
>>> 
>>> 3 - Client realize its connection was lost.
>>> Client notices it was evicted.
>>> It reconnects.
>>> 
>>> (To be sure) When client is evicted, all undergoing I/O are lost, no 
>>> recovery will be done for that?
>>> 
>>> We are thinking to increase timeout to give more time to clients to 
>>> answer the ldlm revocation.
>>> (maybe it is just too loaded)
>>> - Is ldlm_timeout enough to do so?
>>> - Do we need to also change obd_timeout in accordance? Is there a risk 
>>> to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
>>> 
>>> Any feedback in this area is welcomed.
>>> 
>>> Thank you
>>> 
>>> Aurélien Degrémont
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss