[Lustre-discuss] Using brw_stats to diagnose lustre performance

Tue Jun 15 14:08:09 PDT 2010

Also setting the max RPC size on the client to be 768kB would avoid  
the need for each RPC to generate 2 IO requests.

It is possible with newer tune2fs to set the RAID stripe size and the  
allocator (mballoc) will use that size. There is a bug open to  
transfer this "optimal " size to the client, but it hasn't gotten mug  
attention since most sites are set up with 1MB stripe size.

Cheers, Andreas

On 2010-06-15, at 14:19, Kevin Van Maren <kevin.van.maren at oracle.com>  
wrote:

> Live is much easier with a 1MB (or 512KB) native raid stripe size.
>
>
> It looks like most IOs are being broken into 2 pieces.  See
> https://bugzilla.lustre.org/show_bug.cgi?id=22850
> for a few tweaks that would help get IOs > 512KB to disk.  See also  
> Bug 9945
>
> But you are also seeing IOs "combined" into pieces that are between 1
> and 2 raid stripes, so set
> /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler
> does not "help" too much.
>
> There are mkfs options to tell ldiskfs your native raid stripe size.
> You probably also want to change
> the client stripe size (lfs setstripe) to be an integral multiple of  
> the
> raid size (ie, not the default 1MB).
>
> Also note that those are power-of-2 buckets, so your 768KB chunks  
> aren't
> going to be listed as "768".
>
> Kevin
>
>
> mark wrote:
>> Hi Everyone,
>>
>> I'm trying to diagnose some performance concerns we are having  
>> about our
>> lustre deployment.  It seems to be a fairly multifaceted problem
>> involving how ifort does buffered writes along with how we have  
>> lustre
>> setup.
>>
>> What I've identified so far is that our raid stripe size on the  
>> OSTs is
>> 768KB (6 * 128KB chunks) and the partitions are not being mounted  
>> with
>> -o strip.  We have 2 luns per controller and each virtual disk has 2
>> partitions with the 2nd one being the lustre file system.  It is
>> possible the partitions are not aligned.  Most of the client side
>> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per  
>> OST,
>> etc).  The journals are on separate SSDs.  Our OSSes are probably
>> oversubscribed.
>>
>> What we've noticed is that with certain apps we get *really* bad
>> performance to the OSTs.  As bad as 500-800KB/s to one OST.  The best
>> performance I've seen to an OST is around 300MB/s, with 500MB/s being
>> more or less the upper bound limited by IB.
>>
>> Right now I'm trying to verify that fragmentation is happening like I
>> would expect given the configuration mentioned above.  I just learned
>> about brw_stats, so I tried examining it for one of our OSTs (It  
>> looks
>> like lustre must have been restarted recently with so little data):
>>
>> disk fragmented I/Os   ios   % cum % |  ios   % cum %
>> 1:                 0   0   0   |  215   9   9
>> 2:                 0   0   0   | 2004  89  98
>> 3:                 0   0   0   |   22   0  99
>> 4:                 0   0   0   |    2   0  99
>> 5:                 0   0   0   |    5   0  99
>> 6:                 0   0   0   |    2   0  99
>> 7:                 1 100 100   |    1   0 100
>>
>> disk I/O size          ios   % cum % |  ios   % cum %
>> 4K:                 3  42  42   |   17   0   0
>> 8K:                 0   0  42   |   17   0   0
>> 16K:                 0   0  42   |   22   0   1
>> 32K:                 0   0  42   |   73   1   2
>> 64K:                 1  14  57   |  292   6   9
>> 128K:                 0   0  57   |  385   8  18
>> 256K:                 3  42 100   |   88   2  20
>> 512K:                 0   0 100   | 1229  28  48
>> 1M:                 0   0 100   | 2218  51 100
>>
>> My questions are:
>>
>> 1) Does a disk framentation of "1" mean that those IO was  
>> fragmented or
>> would that be "0"?
>>
>> 2) Does the disk I/O size mean what lustre actually wrote or what it
>> wanted to write?  What does that number mean in the context of our  
>> 768KB
>> stripe size since it lists so many I/Os at 1M?
>>
>> Thanks,
>> Mark
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss