[Lustre-discuss] Too many client eviction

Andreas Dilger adilger at whamcloud.com
Sat May 7 20:29:06 PDT 2011


Aurelien, now that I think about it, it may be that the LNET errors are turned off by default. You should check if the "neterr" debug flag is on. Otherwise LNET errors are nor printed to the console by default. 

Cheers, Andreas

On 2011-05-04, at 8:05 AM, DEGREMONT Aurelien <aurelien.degremont at cea.fr> wrote:

> Johann Lombardi a écrit :
>> On Wed, May 04, 2011 at 01:37:14PM +0200, DEGREMONT Aurelien wrote:
>>  
>>>> I assume that the 25315s is from a bug
>>>>      
>> BTW, do you see this problem with both extent & inodebits locks?
>>  
> Yes both. But more often on MDS.
>>> How can I track those dropped RPCs on routers?
>>>    
>> 
>> I don't think routers can drop RPCs w/o a good reason. It is just that a router failure can lead to packet loss and given that servers don't resend local callbacks, this can result in client evictions.
>>  
> Currently I do not see any issue with the routers.
> Logs are very silent and load is very low. Nothing looks like router failure.
> If LNET decides to drop packet for some buggy reason, I would expect to have it, at least, say something in kernel log ("omg i've drop 2 packets, please expect evictions :))"
> 
>>> if client/server do not re-send their RPC.
>>>    
>> To be clear, clients go through a disconnect/reconnect cycle and eventually resend RPCs.
>>  
> I'm not sure I understand clearly what happens there.
> If client did not respond to server ast, it will be evicted by the server. Server do not seem to send a message to tell it (why bother as it seems it is unresponsive or dead anyway?).
> Client realizes at next obd_ping that connection does not exist anymore (rc=-107 ENOTCONN).
> Then it try to reconnect, and at that time, server tells it, it is really evicted. Client says "in progress operation will fail". AFAIK, this means dropping all locks, all dirty pages. Async I/O are lost. Connection status becomes EVICTED. I/O during this window will receive -108, ESHUTDOWN, (kernel log said @@@ IMP_INVALID, see ptlrpc_import_delay_req()).
> Then client reconnects, but some I/O were lost, user program could have experienced errors from I/O syscall.
> 
> This is not the same as a connection timeout, where client will try a failover and do a disconnect/recovery cycle, everything is ok.
> 
> Is this correct?
> 
>> That's bug 3622. Fanyong also used to work on a patch, see http://review.whamcloud.com/#change,125.
>>  
> This looks very interesting as it seems to match our issue. But unfortunately, no news since 2 months.
> 
> 
> 
> Aurélien
> 



More information about the lustre-discuss mailing list