[Lustre-discuss] Using brw_stats to diagnose lustre performance

Tue Jun 15 14:58:59 PDT 2010

Mark Nelson wrote:
> Hi Kevin and Andreas,
>
> Thank you both for the excellent information!  At this point I doubt 
> I'll be able to configure the raid arrays for a 1MB stripe size (As 
> much as I would like to).  How can I change the max RPC size to 768KB 
> on the client?

Set max_pages_per_rpc (drop from 256 to 192), same way you would set 
max_rpcs_in_flight, with something like:
# lctl conf_param lustre.osc.max_pages_per_rpc=192

BTW, a blatant plug, but see:
http://www.oracle.com/us/support/systems/advanced-customer-services/readiness-service-lustre-ds-077261.pdf


> So far on my list:
>
> - work with tune2fs to set good stripe parameters for the FS.
> - mount with -o strip=N (is this needed if tuen2fs sets defaults?)
> - examine alignment of lustre partitions.
> - set /sys/block/sd*/queue/max_sectors_kb to 768.
> - set the client stripe size to 768kb.
> - change the max RPC size on the clients to 768kb (not sure how yet).
> - Upgrade to 1.8.4 to get benefits of patches mentioned in bug #22850.
>
> I am also considering:
> - Increasing RPCs in flight.
> - Increasing dirty Cache size.
> - Disabling lnet debugging.
> - Changing OST service thread count.
> - Checking out MDS configuration and raid.
>
> Anything I'm missing?
>
> Thanks,
> Mark
>
> Andreas Dilger wrote:
>> Also setting the max RPC size on the client to be 768kB would avoid 
>> the need for each RPC to generate 2 IO requests.
>>
>> It is possible with newer tune2fs to set the RAID stripe size and the 
>> allocator (mballoc) will use that size. There is a bug open to 
>> transfer this "optimal " size to the client, but it hasn't gotten mug 
>> attention since most sites are set up with 1MB stripe size.
>>
>> Cheers, Andreas
>>
>> On 2010-06-15, at 14:19, Kevin Van Maren <kevin.van.maren at oracle.com> 
>> wrote:
>>
>>> Live is much easier with a 1MB (or 512KB) native raid stripe size.
>>>
>>>
>>> It looks like most IOs are being broken into 2 pieces.  See
>>> https://bugzilla.lustre.org/show_bug.cgi?id=22850
>>> for a few tweaks that would help get IOs > 512KB to disk.  See also 
>>> Bug 9945
>>>
>>> But you are also seeing IOs "combined" into pieces that are between 1
>>> and 2 raid stripes, so set
>>> /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler
>>> does not "help" too much.
>>>
>>> There are mkfs options to tell ldiskfs your native raid stripe size.
>>> You probably also want to change
>>> the client stripe size (lfs setstripe) to be an integral multiple of 
>>> the
>>> raid size (ie, not the default 1MB).
>>>
>>> Also note that those are power-of-2 buckets, so your 768KB chunks 
>>> aren't
>>> going to be listed as "768".
>>>
>>> Kevin
>>>
>>>
>>> mark wrote:
>>>> Hi Everyone,
>>>>
>>>> I'm trying to diagnose some performance concerns we are having 
>>>> about our
>>>> lustre deployment.  It seems to be a fairly multifaceted problem
>>>> involving how ifort does buffered writes along with how we have lustre
>>>> setup.
>>>>
>>>> What I've identified so far is that our raid stripe size on the 
>>>> OSTs is
>>>> 768KB (6 * 128KB chunks) and the partitions are not being mounted with
>>>> -o strip.  We have 2 luns per controller and each virtual disk has 2
>>>> partitions with the 2nd one being the lustre file system.  It is
>>>> possible the partitions are not aligned.  Most of the client side
>>>> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per 
>>>> OST,
>>>> etc).  The journals are on separate SSDs.  Our OSSes are probably
>>>> oversubscribed.
>>>>
>>>> What we've noticed is that with certain apps we get *really* bad
>>>> performance to the OSTs.  As bad as 500-800KB/s to one OST.  The best
>>>> performance I've seen to an OST is around 300MB/s, with 500MB/s being
>>>> more or less the upper bound limited by IB.
>>>>
>>>> Right now I'm trying to verify that fragmentation is happening like I
>>>> would expect given the configuration mentioned above.  I just learned
>>>> about brw_stats, so I tried examining it for one of our OSTs (It looks
>>>> like lustre must have been restarted recently with so little data):
>>>>
>>>> disk fragmented I/Os   ios   % cum % |  ios   % cum %
>>>> 1:                 0   0   0   |  215   9   9
>>>> 2:                 0   0   0   | 2004  89  98
>>>> 3:                 0   0   0   |   22   0  99
>>>> 4:                 0   0   0   |    2   0  99
>>>> 5:                 0   0   0   |    5   0  99
>>>> 6:                 0   0   0   |    2   0  99
>>>> 7:                 1 100 100   |    1   0 100
>>>>
>>>> disk I/O size          ios   % cum % |  ios   % cum %
>>>> 4K:                 3  42  42   |   17   0   0
>>>> 8K:                 0   0  42   |   17   0   0
>>>> 16K:                 0   0  42   |   22   0   1
>>>> 32K:                 0   0  42   |   73   1   2
>>>> 64K:                 1  14  57   |  292   6   9
>>>> 128K:                 0   0  57   |  385   8  18
>>>> 256K:                 3  42 100   |   88   2  20
>>>> 512K:                 0   0 100   | 1229  28  48
>>>> 1M:                 0   0 100   | 2218  51 100
>>>>
>>>> My questions are:
>>>>
>>>> 1) Does a disk framentation of "1" mean that those IO was 
>>>> fragmented or
>>>> would that be "0"?
>>>>
>>>> 2) Does the disk I/O size mean what lustre actually wrote or what it
>>>> wanted to write?  What does that number mean in the context of our 
>>>> 768KB
>>>> stripe size since it lists so many I/Os at 1M?
>>>>
>>>> Thanks,
>>>> Mark
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>