[Lustre-devel] server-side resending & bulk transfer

Fri Feb 5 09:12:51 PST 2010

Johann,

cc-ing lustre-devel.

Yes, the server could retry the bulk if it times out and this
will be safe for the client since its bulk buffer is auto-unlinked,
so only 1 bulk PUT/GET can match it.  But if the problem happens
on the way back to the server rather than the way out to the client,
you're hosed since the bulk has completed from the client's POV.

This should be an exceptional circumstance - i.e. a router has
actually failed - so I think it's better just to stick with the
client retrying from scratch rather than tying down a server thread
until it has decided whether there was a router failure or the
client really crashed.

Roll on the health network! :)

    Cheers,
              Eric

> -----Original Message-----
> From: Johann Lombardi [mailto:johann at sun.com]
> Sent: 05 February 2010 4:35 PM
> To: lustre-tech-leads at sun.com
> Subject: server-side resending & bulk transfer
> 
> Hi,
> 
> As you know, the most important part of server-side resending is to resend
> lock callbacks since a lost of such a message ends up with a client eviction
> (except for glimpses which are resent indefinitely causing other problems).
> 
> That being said, another aspect is losing a message during bulk transfer, and
> more particularly the start bulk signal issued by LNET.
> Unlike lock callback rpcs, losing the start bulk signal is not fatal since
> the bulk transfer will timeout on the server side, the request be dropped
> and the client will resend after reconnection. This is indeed harmless,
> but still causes slowdown which could be avoided according to LLNL if we
> try to resend the start bulk signal (bug 21714). Brian Behlendorf's
> proposal is to resend the start bulk signal after the first l_wait_event()
> timeout in ost_brw_write(). However, we don't know if this is safe to do,
> e.g. how does the client react if it receives duplicated start bulk signals?
> 
> Johann