[Lustre-discuss] hacking max_sectors

Thu Aug 27 22:28:56 PDT 2009

On Wed, Aug 26, 2009 at 04:11:12AM -0600, Andreas Dilger wrote:
>On Aug 26, 2009  00:46 -0400, Robin Humble wrote:
>> with the patch, 1M i/o's are being fed to md (according to brw_stats),
>> and performance is a little better for RAID6 8+2 with 128k chunks, and
>> a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed
>> half 512k and half 1M i/o's by Lustre).
>This was the other question I'd asked internally.  If the array is
>formatted with 64kB chunks then 512k IOs shouldn't cause any read-modify-
>write operations and (in theory) give the same performance as 1M IOs on
>a 128kB chunksize array.  What is the relative performance of the
>64kB and 128kB configurations?

on these 1TB SATA RAID6 8+2's and external journals, with 1 client
writing to 1 OST, with 2.6.18-128.1.14.el5 + lustre1.8.1 + blkdev/md
patches from https://bugzilla.lustre.org/show_bug.cgi?id=20533 so that
128k chunk md gets 1M i/o's and 64k chunk md gets 512k i/o's then ->

client max_rpcs_in_flight 8
 md chunk    write (MB/s)    read (MB/s)
     64k       185            345
    128k       235            390

so 128k chunks are 10-30% quicker than 64k in this particular setup on
big streaming i/o tests (1G of 1M lmdd's).
having said that, 1.6.7.2 servers do better than 1.8.1 on some configs
(I haven't had time to figure out why) but the trend of 128k chunks
being faster than 64k chunks remains. also if the i/o load was messier
and involved smaller i/o's then 64k chunks might claw something back -
probably not enough though.

BTW, whilst we're on the topic - what does this part of brw_stats
mean?
                             read      |     write
  disk fragmented I/Os   ios   % cum % |  ios   % cum %
  1:                    5742 100 100   | 103186 100 100

this is for the 128k chunk case, where the rest of brw_stats says I'm
seeing 1M rpc's and 1M i/o's, but I'm not sure what '1' disk fragmented
i/o's means - should it be 0? or does '1' mean unfragmented?

sorry for packing too many questions into one email, but these slowish
SATA disks seem to need a lots of rpc's in flight for good performance.
32 max_dirty_mb (the default) and 32 max_rpcs_in_flight seems a good
magic combo. with that I get:

client max_rpcs_in_flight 32
 md chunk    write (MB/s)    read (MB/s)
     64k       275            450
    128k       395            480

which is a lot faster...
with a heavier load of 20 clients hammering 4 OSS's each with 4 R6 8+2
OSTs I still see about a 10% advantage for clients with 32 rpcs.

is there a down side to running clients with max_rpcs_in_flight 32 ?
the initial production machine will be ~1500 clients and ~25 OSS's.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility