[Lustre-discuss] Too many client eviction

DEGREMONT Aurelien aurelien.degremont at cea.fr
Wed May 4 07:05:56 PDT 2011


Johann Lombardi a écrit :
> On Wed, May 04, 2011 at 01:37:14PM +0200, DEGREMONT Aurelien wrote:
>   
>>> I assume that the 25315s is from a bug
>>>       
> BTW, do you see this problem with both extent & inodebits locks?
>   
Yes both. But more often on MDS.
>> How can I track those dropped RPCs on routers?
>>     
>
> I don't think routers can drop RPCs w/o a good reason. It is just that a router failure can lead to packet loss and given that servers don't resend local callbacks, this can result in client evictions.
>   
Currently I do not see any issue with the routers.
Logs are very silent and load is very low. Nothing looks like router 
failure.
If LNET decides to drop packet for some buggy reason, I would expect to 
have it, at least, say something in kernel log ("omg i've drop 2 
packets, please expect evictions :))"

>> if client/server do not re-send their RPC.
>>     
> To be clear, clients go through a disconnect/reconnect cycle and eventually resend RPCs.
>   
I'm not sure I understand clearly what happens there.
If client did not respond to server ast, it will be evicted by the 
server. Server do not seem to send a message to tell it (why bother as 
it seems it is unresponsive or dead anyway?).
Client realizes at next obd_ping that connection does not exist anymore 
(rc=-107 ENOTCONN).
Then it try to reconnect, and at that time, server tells it, it is 
really evicted. Client says "in progress operation will fail". AFAIK, 
this means dropping all locks, all dirty pages. Async I/O are lost. 
Connection status becomes EVICTED. I/O during this window will receive 
-108, ESHUTDOWN, (kernel log said @@@ IMP_INVALID, see 
ptlrpc_import_delay_req()).
Then client reconnects, but some I/O were lost, user program could have 
experienced errors from I/O syscall.

This is not the same as a connection timeout, where client will try a 
failover and do a disconnect/recovery cycle, everything is ok.

Is this correct?

> That's bug 3622. Fanyong also used to work on a patch, see http://review.whamcloud.com/#change,125.
>   
This looks very interesting as it seems to match our issue. But 
unfortunately, no news since 2 months.



Aurélien




More information about the lustre-discuss mailing list