[Lustre-discuss] Too many client eviction
Andreas Dilger
adilger at whamcloud.com
Tue May 3 09:42:17 PDT 2011
I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. I believe that LLNL has some adjusted tunables for AT that might help for you (increased at_min, etc).
Hopefully Chris or someone at LLNL can comment. I think they were also documented in bugzilla, though I don't know the bug number.
Cheers, Andreas
On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien <aurelien.degremont at cea.fr> wrote:
> Hello
>
> We often see some of our Lustre clients being evicted abusively (clients
> seem healthy).
> The pattern is always the same:
>
> All of this on Lustre 2.0, with adaptative timeout enabled
>
> 1 - A server complains about a client :
> ### lock callback timer expired... after 25315s...
> (nothing on client)
>
> (few seconds later)
>
> 2 - The client receives -107 to a obd_ping for this target
> (server says "@@@processing error 107")
>
> 3 - Client realize its connection was lost.
> Client notices it was evicted.
> It reconnects.
>
> (To be sure) When client is evicted, all undergoing I/O are lost, no
> recovery will be done for that?
>
> We are thinking to increase timeout to give more time to clients to
> answer the ldlm revocation.
> (maybe it is just too loaded)
> - Is ldlm_timeout enough to do so?
> - Do we need to also change obd_timeout in accordance? Is there a risk
> to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
>
> Any feedback in this area is welcomed.
>
> Thank you
>
> Aurélien Degrémont
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list