[Lustre-devel] hiding non-fatal communications errors
Oleg.Drokin at Sun.COM
Wed Jun 4 21:12:14 PDT 2008
On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote:
> I suspect this could be adapted to allowing a fixed number of
> retries for
> server-originated RPCs also. In the case of LDLM blocking callbacks
> to a client, a resend is currently harmless (either the client is
> processing the callback, or the lock was cancelled).
We need to be careful here and decide on a good strategy on when to
E.g. recent case at ORNL (even if a bit pathologic) is they pound
thousands of clients to 4 OSSes via 2 routers. That creates request
lists on OSSes well into tens of thousands. When we block on a lock
blocking AST to the client, it quickly turns around and puts in his
at the end of our list that takes hundreds of seconds (more than
obviously). No matter how much you resend, it won't help.
Now a good argument is before we kill such clients (or do any sort of
perhaps it makes sense to check incoming queue to see if there is
On the other hand that would be like half of request scheduler,
with such queues, it would take ages, I guess.
BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2,
why is that?
(when AT is disabled). All this is bug 15332.
Or was the resend meant just for initial RPC where we do not get a
soon? Yes, there it makes sense to retry soon, but this case above
needs to be
still considered, since currently we do not retry writeouts too, which
much of a bad effect on dirty client caches, and of course all the
above is very
true in such cases too.
Also without lnet patch in 15332, where small messages are prioritized
it is way too easy to timeout ast response because of router
congestion and no
amount of resending would help then.
More information about the lustre-devel