[Lustre-discuss] Too many client eviction

DEGREMONT Aurelien aurelien.degremont at cea.fr
Wed May 4 04:37:14 PDT 2011


Hello

Andreas Dilger a écrit :
> On May 3, 2011, at 13:41, Nathan Rutman wrote:
>   
>> On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote:
>>     
>>> Correct me if I'm wrong, but when I'm looking at Lustre manual, it said 
>>> that client is adapting its timeout, but not the server. I'm understood 
>>> that server->client RPC still use the old mechanism, especially for our 
>>> case where it seems server is revoking a client lock (ldlm_timeout is 
>>> used for that?) and client did not respond.
>>>       
>> Server and client cooperate together for the adaptive timeouts.  I don't remember which bug the ORNL settings were in, maybe 14071, bugzilla's not responding at the moment.  But a big question here is why 25315 seconds for a callback - that's well beyond anything at_max should allow...
>>     
>
> I assume that the 25315s is from a bug (fixed in 1.8.5 I think, not sure if it was ported to 2.x) that calculated the wrong time when printing this error message for LDLM lock timeouts.
>   
I did not find the bug for that.
>>> I forgot to say that we have LNET routers also involved for some cases.
>>>       
> If there are routers they can cause dropped RPCs from the server to the client, and the client will be evicted for unresponsiveness even though it is not at fault.  At one time Johann was working on a patch (or at least investigating) the ability to have servers resend RPCs before evicting clients.  The tricky part is that you don't want to send 2 RPCs each with 1/2 the timeout interval, since that may reduce stability instead of increasing it.
>   
How can I track those dropped RPCs on routers?
Is this an expected behaviour? How could I protect my filesystem from 
that? If I increase the timeout this won't change anything if 
client/server do not re-send their RPC.

> I think the bugzilla bug was called "limited server-side resend" or similar, filed by me several years ago.
>   
Did not find either :)

Aurélien



More information about the lustre-discuss mailing list