[Lustre-devel] hiding non-fatal communications errors

Thu Jun 19 13:24:42 PDT 2008

Eric Barton wrote:
> Oleg's comments about congestion and the ORNL discussions I've been
> involved in are effectively presenting arguments for allowing
> expedited communications.  This is possible but comes at a cost.
>   
> The "proper" implementation effectively holds an uncongested network
> in reserve for expedited communications.  That's a high price to pay
> because it pretty well means doubling up all the LNET state - twice
> the number of queues/sockets/queuepairs/connections.  That's
> unavoidable since we're using these structures for backpressure and
> once they're "full" you can only bypass with an additional connection.
>   
That's assuming network congestion is the cause of the lock timeout.  
What if the server disk is busy doing who knows what, the client's cache 
flush RPCs are all sitting on the server in the request queue just 
waiting for some disk time.  Furthermore assume that a bunch of other 
clients are all doing the same thing, so that we can't simply prioritize 
this clients RPCs over everybody else's. 

I think the method suggested by Oleg has the most potential in this 
case: "sniff" the incoming RPCs to see if they are cache flushes, and do 
not decide to evict those clients until after those RPCs have been 
processed.  As mentioned, we already do sniff the incoming reqs to check 
adaptive timeout deadlines (ptlrpc_server_handle_req_in).

One further thing I would like to do is respond to "easy" RPCs 
immediately (in a reserved thread).  "Easy" would certainly include 
pings, maybe others that have no disk access.  This would allow us to 
free up LNET buffers and other resources, prevent us from evicting 
clients "we haven't heard from in X seconds" (although I just realized 
we could fix that right now in ptlrpc_server_handle_req_in), and more 
quickly determine network and server loading remotely.