[Lustre-discuss] Too many client eviction

Johann Lombardi johann at whamcloud.com
Wed May 4 05:21:47 PDT 2011


On Wed, May 04, 2011 at 01:37:14PM +0200, DEGREMONT Aurelien wrote:
> > I assume that the 25315s is from a bug

BTW, do you see this problem with both extent & inodebits locks?

> (fixed in 1.8.5 I think, not sure if it was ported to 2.x) that calculated the wrong time when printing this error message for LDLM lock timeouts.
> >
> I did not find the bug for that.

I think Andreas was referring to bug 17887. However you should have the patch applied already since it was landed for 2.0.0.

> > If there are routers they can cause dropped RPCs from the server to the client, and the client will be evicted for unresponsiveness even though it is not at fault.  At one time Johann was working on a patch (or at least investigating) the ability to have servers resend RPCs before evicting clients.  The tricky part is that you don't want to send 2 RPCs each with 1/2 the timeout interval, since that may reduce stability instead of increasing it.
> >
> How can I track those dropped RPCs on routers?

I don't think routers can drop RPCs w/o a good reason. It is just that a router failure can lead to packet loss and given that servers don't resend local callbacks, this can result in client evictions.

> Is this an expected behaviour?

Well, let's call this a known problem we would like to address at some point.

> How could I protect my filesystem from that? If I increase the timeout
> this won't change anything

Right, tweaking timeouts cannot help here.

> if client/server do not re-send their RPC.

To be clear, clients go through a disconnect/reconnect cycle and eventually resend RPCs.

> > I think the bugzilla bug was called "limited server-side resend" or similar, filed by me several years ago.
> >
> Did not find either :)

That's bug 3622. Fanyong also used to work on a patch, see http://review.whamcloud.com/#change,125.

HTH

Cheers,
Johann



More information about the lustre-discuss mailing list