[Lustre-discuss] write RPC & congestion

Sun Aug 22 10:58:51 PDT 2010

Andreas Dilger wrote:
> On 2010-08-17, at 14:15, burlen wrote:
>   
>> I have some question about Lustre RPC and the sequence of events that 
>> occur during large concurrent write() involving many processes and large 
>> data size per process.  I understand there is a mechanism of flow 
>> control by credits, but I'm a little unclear on how it works in general 
>> after reading the "networking & io protocol" white paper.
>>     
>
> There are different levels of flow control.  There is one at the LNET level, that controls low-level messages from overwhelming the server with messages, and avoiding stalling small/reply messages at the back of a deep queue of requests.
>
>   
>> Is it true that a write() RPC transfer's data in chunks of at least 1MB 
>> and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? 
>> I can use the bounds to estimate the number of RPCs issued per MB of 
>> data to write?
>>     
>
> Currently, 1MB is the largest bulk IO size, and is the typical size used by clients for all IO.
>
>   
>> About how many concurrent incoming write() RPC per OSS service thread 
>> can a single server handle before it stops responding to incoming RPCs ?
>>     
>
> The server can handle tens of thousands of write _requests_, but note that since Lustre has always been designed as an RDMA-capable protocol the request is relatively small (a few hundreds of bytes) and does not contain any of the DATA.
>
> When one of the server threads is ready to process a read/write request it will get or put the data from/to the buffers that the client already prepared.  The number of currently active IO requests is exactly the number of active service threads (up to 512 by default).
>
>   
>> What happens to an RPC when the server is too busy to handle it, is it 
>> even issued by the client ? Does the client have to poll and/or resend 
>> the RPC ? Does the process of polling for flow control credits add 
>> significant network/server congestion ?
>>     
>
> The clients limit the number of concurrent RPC requests, by default to 8 per OST.  The LNET level message credits will also limit the number of in-flight messages in case there is e.g. an LNET router between the client and server.
>
> The client will almost never time out a request, as it is informed how long requests are currently taking to process and will wait patiently for its earlier requests to finish processing.  If the client is going to time out a request (based on an earlier request timeout that is about to be exceeded) the server will inform it to continue waiting and give it a new processing time estimate (unless of course the server is non-functional or so overwhelmed that it can't even do that).
>
>   
>> Is it likely that a large number of RPC's/flow control credit requests 
>> will induce enough network congestion so that client's RPC's timeout ? 
>> How does the client handle such a timeout ?
>>     
>
> Since the flow control credits are bounded, and will be returned to the peer as earlier requests complete there is not additional traffic due to this.  However, considering that HPC clusters are distributed denial-of-service engines it is always possible to overwhelm the server under some conditions.  In case of a client RPC timeout (hundreds of seconds under load) the client will resend the request and/or try to contact the backup server until one responds.
>   
Thank you for you help.

Is my understanding correct?

A single RPC request will initiate an RDMA transfer of at most 
"max_pages_per_rpc". where the page unit is Lustre page size 65536. Each 
RDMA transfer is executed in 1MB chunks.  On a given client, if there 
are more than "max_pages_per_rpc" pages of data available to transfer , 
multiple RPCs are issued and multiple RDMA's are initiated.

Would it be correct to say: The purpose of the "max_pages_per_rpc" 
parameter is to enable the servers to even out the individual progress 
of concurrent clients with a lot of data to move and more fairly 
apportion the available bandwidth amongst concurrently writing clients?