[Lustre-devel] hiding non-fatal communications errors

Thu Jun 5 09:42:21 PDT 2008

On Jun 4, 2008, at 21:12 , Oleg Drokin wrote:

> Hello!
>
> On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote:
>
>> I suspect this could be adapted to allowing a fixed number of
>> retries for
>> server-originated RPCs also.  In the case of LDLM blocking callbacks
>> sent
>> to a client, a resend is currently harmless (either the client is
>> already
>> processing the callback, or the lock was cancelled).
>
> We need to be careful here and decide on a good strategy on when to
> resend.
> E.g. recent case at ORNL (even if a bit pathologic) is they pound
> through
> thousands of clients to 4 OSSes via 2 routers. That creates request
> waiting
> lists on OSSes well into tens of thousands. When we block on a lock
> and send
> blocking AST to the client, it quickly turns around and puts in his
> data...
> at the end of our list that takes hundreds of seconds (more than
> obd_timeout,
> obviously). No matter how much you resend, it won't help.

This looks like the poster child for adaptive timeouts, although we  
might want need some version of the early margin update patch on  
15501.  Have you tried enabling AT?

>
> Now a good argument is before we kill such clients (or do any sort of
> resend),
> perhaps it makes sense to check incoming queue to see if there is
> anything?
> On the other hand that would be like half of request scheduler,
> probably, and
> with such queues, it would take ages, I guess.
> BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2,
> why is that?
> (when AT is disabled). All this is bug 15332.
>

Maybe that's was done to discourage people from disabling AT?  
Seriously, though, I don't know why that was changed. Perhaps it was  
done on b1_6 before to AT landed?

robert

> Or was the resend meant just for initial RPC where we do not get a
> confirmation
> soon? Yes, there it makes sense to retry soon, but this case above
> needs to be
> still considered, since currently we do not retry writeouts too, which
> has as
> much of a bad effect on dirty client caches, and of course all the
> above is very
> true in such cases too.
> Also without lnet patch in 15332, where small messages are prioritized
> on routers,
> it is way too easy to timeout ast response because of router
> congestion and no
> amount of resending would help then.
>
> Bye,
>     Oleg
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel