[Lustre-discuss] Too many client eviction

Tue May 3 13:05:22 PDT 2011

On May 3, 2011, at 13:41, Nathan Rutman wrote:
> On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote:
>> Correct me if I'm wrong, but when I'm looking at Lustre manual, it said 
>> that client is adapting its timeout, but not the server. I'm understood 
>> that server->client RPC still use the old mechanism, especially for our 
>> case where it seems server is revoking a client lock (ldlm_timeout is 
>> used for that?) and client did not respond.
> 
> Server and client cooperate together for the adaptive timeouts.  I don't remember which bug the ORNL settings were in, maybe 14071, bugzilla's not responding at the moment.  But a big question here is why 25315 seconds for a callback - that's well beyond anything at_max should allow...

I assume that the 25315s is from a bug (fixed in 1.8.5 I think, not sure if it was ported to 2.x) that calculated the wrong time when printing this error message for LDLM lock timeouts.

>> I forgot to say that we have LNET routers also involved for some cases.

If there are routers they can cause dropped RPCs from the server to the client, and the client will be evicted for unresponsiveness even though it is not at fault.  At one time Johann was working on a patch (or at least investigating) the ability to have servers resend RPCs before evicting clients.  The tricky part is that you don't want to send 2 RPCs each with 1/2 the timeout interval, since that may reduce stability instead of increasing it.

I think the bugzilla bug was called "limited server-side resend" or similar, filed by me several years ago.

>> Andreas Dilger a écrit :
>>> I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. I believe that LLNL has some adjusted tunables for AT that might help for you (increased at_min, etc).
>>> 
>>> Hopefully Chris or someone at LLNL can comment. I think they were also documented in bugzilla, though I don't know the bug number. 
>>> 
>>> Cheers, Andreas
>>> 
>>> On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien <aurelien.degremont at cea.fr> wrote:
>>> 
>>> 
>>>> Hello
>>>> 
>>>> We often see some of our Lustre clients being evicted abusively (clients 
>>>> seem healthy).
>>>> The pattern is always the same:
>>>> 
>>>> All of this on Lustre 2.0, with adaptative timeout enabled
>>>> 
>>>> 1 - A server complains about a client :
>>>> ### lock callback timer expired... after 25315s...
>>>> (nothing on client)
>>>> 
>>>> (few seconds later)
>>>> 
>>>> 2 - The client receives -107 to a obd_ping for this target
>>>> (server says "@@@processing error 107")
>>>> 
>>>> 3 - Client realize its connection was lost.
>>>> Client notices it was evicted.
>>>> It reconnects.
>>>> 
>>>> (To be sure) When client is evicted, all undergoing I/O are lost, no 
>>>> recovery will be done for that?
>>>> 
>>>> We are thinking to increase timeout to give more time to clients to 
>>>> answer the ldlm revocation.
>>>> (maybe it is just too loaded)
>>>> - Is ldlm_timeout enough to do so?
>>>> - Do we need to also change obd_timeout in accordance? Is there a risk 
>>>> to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
>>>> 
>>>> Any feedback in this area is welcomed.
>>>> 
>>>> Thank you
>>>> 
>>>> Aurélien Degrémont
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>> 
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.