[Lustre-discuss] hacking max_sectors

Fri Aug 28 10:00:43 PDT 2009

On Aug 28, 2009  01:28 -0400, Robin Humble wrote:
> on these 1TB SATA RAID6 8+2's and external journals, with 1 client
> writing to 1 OST, with 2.6.18-128.1.14.el5 + lustre1.8.1 + blkdev/md
> patches from https://bugzilla.lustre.org/show_bug.cgi?id=20533 so that
> 128k chunk md gets 1M i/o's and 64k chunk md gets 512k i/o's then ->
> 
> client max_rpcs_in_flight 8
>  md chunk    write (MB/s)    read (MB/s)
>      64k       185            345
>     128k       235            390
> 
> so 128k chunks are 10-30% quicker than 64k in this particular setup on
> big streaming i/o tests (1G of 1M lmdd's).

Hmm, that is too bad.  I would have hoped that there was minimal difference
between the smaller and the larger chunk size, given that they are still
doing 1MB writes to disk and the data + parity amount is the same.  It
would be interesting to see what the performance is if you change the RPC
size to simulate clients doing smaller IOs:

	lctl set_param osc.*.max_pages_per_rpc={128,64}

Depending on how well-behaved your applications are this could make
a noticable difference in "real world" application performance.  You
could also check the brw_stats "pages per bulk r/w" on an existing
filesystem that has been running for a while in order to see the
actual IO size.  Granted, without your patch it will be maxed out at
128 pages, but if there is a significant fraction of IO below that
you may still be better off with the smaller chunk size.

> BTW, whilst we're on the topic - what does this part of brw_stats
> mean?
>                              read      |     write
>   disk fragmented I/Os   ios   % cum % |  ios   % cum %
>   1:                    5742 100 100   | 103186 100 100
> 
> this is for the 128k chunk case, where the rest of brw_stats says I'm
> seeing 1M rpc's and 1M i/o's, but I'm not sure what '1' disk fragmented
> i/o's means - should it be 0? or does '1' mean unfragmented?

That means the read/write request was submitted to disk in a single
fragment, which is ideal.  On my system there are a small number of
read requests that have "0" fragments.  These are for reads of a hole,
or at EOF that return no data at all.

> sorry for packing too many questions into one email, but these slowish
> SATA disks seem to need a lots of rpc's in flight for good performance.
> 32 max_dirty_mb (the default) and 32 max_rpcs_in_flight seems a good
> magic combo. with that I get:
> 
> client max_rpcs_in_flight 32
>  md chunk    write (MB/s)    read (MB/s)
>      64k       275            450
>     128k       395            480
> 
> which is a lot faster...
> with a heavier load of 20 clients hammering 4 OSS's each with 4 R6 8+2
> OSTs I still see about a 10% advantage for clients with 32 rpcs.

Interesting.  We haven't tuned this recently except in the case of WAN,
but I guess the bandwidth of disks and networks is increasing enough
that it just needs more 1MB RPCs to keep the pipe full.  We've also
tested 4MB RPCs (bug 16900 has a patch), but this gave us mixed
performance in our environment.  You could give this a try if you are
interested and report the results here.

> is there a down side to running clients with max_rpcs_in_flight 32 ?
> the initial production machine will be ~1500 clients and ~25 OSS's.

For 1500 clients it shouldn't be an issue, though it can make for
longer latency for some operations.  In the past we were also limited
by the number of request buffers on the server, but that is dynamic
these days and flow-controlled, and we've tested with up to 26000
clients on a single filesystem (192 OSSes).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.