[Lustre-discuss] Too many client eviction
Nathan Rutman
nrutman at gmail.com
Tue May 3 12:41:21 PDT 2011
On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote:
> Correct me if I'm wrong, but when I'm looking at Lustre manual, it said
> that client is adapting its timeout, but not the server. I'm understood
> that server->client RPC still use the old mechanism, especially for our
> case where it seems server is revoking a client lock (ldlm_timeout is
> used for that?) and client did not respond.
Server and client cooperate together for the adaptive timeouts. I don't remember which bug the ORNL settings were in, maybe 14071, bugzilla's not responding at the moment. But a big question here is why 25315 seconds for a callback - that's well beyond anything at_max should allow...
>
> I forgot to say that we have LNET routers also involved for some cases.
>
> Thank you
>
> Aurélien
>
> Andreas Dilger a écrit :
>> I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. I believe that LLNL has some adjusted tunables for AT that might help for you (increased at_min, etc).
>>
>> Hopefully Chris or someone at LLNL can comment. I think they were also documented in bugzilla, though I don't know the bug number.
>>
>> Cheers, Andreas
>>
>> On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien <aurelien.degremont at cea.fr> wrote:
>>
>>
>>> Hello
>>>
>>> We often see some of our Lustre clients being evicted abusively (clients
>>> seem healthy).
>>> The pattern is always the same:
>>>
>>> All of this on Lustre 2.0, with adaptative timeout enabled
>>>
>>> 1 - A server complains about a client :
>>> ### lock callback timer expired... after 25315s...
>>> (nothing on client)
>>>
>>> (few seconds later)
>>>
>>> 2 - The client receives -107 to a obd_ping for this target
>>> (server says "@@@processing error 107")
>>>
>>> 3 - Client realize its connection was lost.
>>> Client notices it was evicted.
>>> It reconnects.
>>>
>>> (To be sure) When client is evicted, all undergoing I/O are lost, no
>>> recovery will be done for that?
>>>
>>> We are thinking to increase timeout to give more time to clients to
>>> answer the ldlm revocation.
>>> (maybe it is just too loaded)
>>> - Is ldlm_timeout enough to do so?
>>> - Do we need to also change obd_timeout in accordance? Is there a risk
>>> to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
>>>
>>> Any feedback in this area is welcomed.
>>>
>>> Thank you
>>>
>>> Aurélien Degrémont
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list