[Lustre-discuss] write RPC & congestion

Mon Aug 23 11:10:15 PDT 2010

On 2010-08-22, at 11:58, burlen wrote:
Andreas Dilger wrote:
>> Currently, 1MB is the largest bulk IO size, and is the typical size used by clients for all IO.
> 
> Is my understanding correct?
> 
> A single RPC request will initiate an RDMA transfer of at most "max_pages_per_rpc". where the page unit is Lustre page size 65536. Each RDMA transfer is executed in 1MB chunks.  On a given client, if there are more than "max_pages_per_rpc" pages of data available to transfer , multiple RPCs are issued and multiple RDMA's are initiated.

No, the max_pages_per_rpc is scaled down proportionately for systems with large PAGE_SIZE.  This is because the node doesn't know what the PAGE_SIZE of the peer is.

There is a patch in bugzilla that does what you propose - submit larger IO request RPCs, and do multiple 1MB RDMA xfers per request.  However, this showed performance _loss_ in some cases (in particular shared-file IO), and the reason for this regression was never diagnosed.

> Would it be correct to say: The purpose of the "max_pages_per_rpc" parameter is to enable the servers to even out the individual progress of concurrent clients with a lot of data to move and more fairly apportion the available bandwidth amongst concurrently writing clients?

Yes, partly.  The more important factor is max_rpcs_in_flight, which limits the number of requests that a client can submit to each server at one time.

There was a research paper written to have dynamic max_rpcs_in_flight that showed performance improvements when there are few clients active, and we'd like to include that code into Lustre when it is ready.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.