[Lustre-devel] hiding non-fatal communications errors
Peter Braam
Peter.Braam at Sun.COM
Wed Jun 4 14:17:19 PDT 2008
Andreas has been suggesting re-transmission of these callback (aka AST) RPCs
for years. If we think it through carefully, it might be a simple solution.
Peter
On 6/4/08 6:25 AM, "Eric Barton" <eeb at sun.com> wrote:
> Something for recovery experts...
>
> Communications may timeout for non-fatal reasons e.g...
>
> 1. Adaptive timeouts were too aggressive (e.g. if server load has
> suddenly become extreme).
>
> 2. An LNET router has failed but one or more of its peers hasn't
> detected this yet.
>
> When a lustre client times out an RPC it sent to a server, it (a) allows
> pending signals to be delivered (i.e. you can now ^C the process doing
> the I/O) and (b) tries to reconnect and/or fail over. If it reconnects
> and confirms that the server has not rebooted, the RPC is resent and
> may now succeed.
>
> This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm
> callbacks (ASTs)) since the server knows whether it actually processed
> the RPC or not and can handle the resent request appropriately.
>
> However I think there is a problem if the RPC is an ldlm callback. In
> this case, the lustre server sends the RPC to the lustre client and
> AFAIK the request is not resent if it times out. If the request is a
> blocking AST, the lustre client isn't notified to clean its cache and
> cancel locks - and it risks being evicted.
>
> How should this be handled?
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
More information about the lustre-devel
mailing list