[Lustre-discuss] write RPC & congestion

Wed Aug 18 19:01:23 PDT 2010

On 2010-08-17, at 14:15, burlen wrote:
> I have some question about Lustre RPC and the sequence of events that 
> occur during large concurrent write() involving many processes and large 
> data size per process.  I understand there is a mechanism of flow 
> control by credits, but I'm a little unclear on how it works in general 
> after reading the "networking & io protocol" white paper.

There are different levels of flow control.  There is one at the LNET level, that controls low-level messages from overwhelming the server with messages, and avoiding stalling small/reply messages at the back of a deep queue of requests.

> Is it true that a write() RPC transfer's data in chunks of at least 1MB 
> and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? 
> I can use the bounds to estimate the number of RPCs issued per MB of 
> data to write?

Currently, 1MB is the largest bulk IO size, and is the typical size used by clients for all IO.

> About how many concurrent incoming write() RPC per OSS service thread 
> can a single server handle before it stops responding to incoming RPCs ?

The server can handle tens of thousands of write _requests_, but note that since Lustre has always been designed as an RDMA-capable protocol the request is relatively small (a few hundreds of bytes) and does not contain any of the DATA.

When one of the server threads is ready to process a read/write request it will get or put the data from/to the buffers that the client already prepared.  The number of currently active IO requests is exactly the number of active service threads (up to 512 by default).

> What happens to an RPC when the server is too busy to handle it, is it 
> even issued by the client ? Does the client have to poll and/or resend 
> the RPC ? Does the process of polling for flow control credits add 
> significant network/server congestion ?

The clients limit the number of concurrent RPC requests, by default to 8 per OST.  The LNET level message credits will also limit the number of in-flight messages in case there is e.g. an LNET router between the client and server.

The client will almost never time out a request, as it is informed how long requests are currently taking to process and will wait patiently for its earlier requests to finish processing.  If the client is going to time out a request (based on an earlier request timeout that is about to be exceeded) the server will inform it to continue waiting and give it a new processing time estimate (unless of course the server is non-functional or so overwhelmed that it can't even do that).

> Is it likely that a large number of RPC's/flow control credit requests 
> will induce enough network congestion so that client's RPC's timeout ? 
> How does the client handle such a timeout ?

Since the flow control credits are bounded, and will be returned to the peer as earlier requests complete there is not additional traffic due to this.  However, considering that HPC clusters are distributed denial-of-service engines it is always possible to overwhelm the server under some conditions.  In case of a client RPC timeout (hundreds of seconds under load) the client will resend the request and/or try to contact the backup server until one responds.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.