[Lustre-discuss] Using brw_stats to diagnose lustre performance

Tue Jun 15 14:55:55 PDT 2010

Hi Kevin and Andreas,

Thank you both for the excellent information!  At this point I doubt 
I'll be able to configure the raid arrays for a 1MB stripe size (As much 
as I would like to).  How can I change the max RPC size to 768KB on the 
client?

So far on my list:

- work with tune2fs to set good stripe parameters for the FS.
- mount with -o strip=N (is this needed if tuen2fs sets defaults?)
- examine alignment of lustre partitions.
- set /sys/block/sd*/queue/max_sectors_kb to 768.
- set the client stripe size to 768kb.
- change the max RPC size on the clients to 768kb (not sure how yet).
- Upgrade to 1.8.4 to get benefits of patches mentioned in bug #22850.

I am also considering:
- Increasing RPCs in flight.
- Increasing dirty Cache size.
- Disabling lnet debugging.
- Changing OST service thread count.
- Checking out MDS configuration and raid.

Anything I'm missing?

Thanks,
Mark

Andreas Dilger wrote:
> Also setting the max RPC size on the client to be 768kB would avoid the 
> need for each RPC to generate 2 IO requests.
> 
> It is possible with newer tune2fs to set the RAID stripe size and the 
> allocator (mballoc) will use that size. There is a bug open to transfer 
> this "optimal " size to the client, but it hasn't gotten mug attention 
> since most sites are set up with 1MB stripe size.
> 
> Cheers, Andreas
> 
> On 2010-06-15, at 14:19, Kevin Van Maren <kevin.van.maren at oracle.com> 
> wrote:
> 
>> Live is much easier with a 1MB (or 512KB) native raid stripe size.
>>
>>
>> It looks like most IOs are being broken into 2 pieces.  See
>> https://bugzilla.lustre.org/show_bug.cgi?id=22850
>> for a few tweaks that would help get IOs > 512KB to disk.  See also 
>> Bug 9945
>>
>> But you are also seeing IOs "combined" into pieces that are between 1
>> and 2 raid stripes, so set
>> /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler
>> does not "help" too much.
>>
>> There are mkfs options to tell ldiskfs your native raid stripe size.
>> You probably also want to change
>> the client stripe size (lfs setstripe) to be an integral multiple of the
>> raid size (ie, not the default 1MB).
>>
>> Also note that those are power-of-2 buckets, so your 768KB chunks aren't
>> going to be listed as "768".
>>
>> Kevin
>>
>>
>> mark wrote:
>>> Hi Everyone,
>>>
>>> I'm trying to diagnose some performance concerns we are having about our
>>> lustre deployment.  It seems to be a fairly multifaceted problem
>>> involving how ifort does buffered writes along with how we have lustre
>>> setup.
>>>
>>> What I've identified so far is that our raid stripe size on the OSTs is
>>> 768KB (6 * 128KB chunks) and the partitions are not being mounted with
>>> -o strip.  We have 2 luns per controller and each virtual disk has 2
>>> partitions with the 2nd one being the lustre file system.  It is
>>> possible the partitions are not aligned.  Most of the client side
>>> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per OST,
>>> etc).  The journals are on separate SSDs.  Our OSSes are probably
>>> oversubscribed.
>>>
>>> What we've noticed is that with certain apps we get *really* bad
>>> performance to the OSTs.  As bad as 500-800KB/s to one OST.  The best
>>> performance I've seen to an OST is around 300MB/s, with 500MB/s being
>>> more or less the upper bound limited by IB.
>>>
>>> Right now I'm trying to verify that fragmentation is happening like I
>>> would expect given the configuration mentioned above.  I just learned
>>> about brw_stats, so I tried examining it for one of our OSTs (It looks
>>> like lustre must have been restarted recently with so little data):
>>>
>>> disk fragmented I/Os   ios   % cum % |  ios   % cum %
>>> 1:                 0   0   0   |  215   9   9
>>> 2:                 0   0   0   | 2004  89  98
>>> 3:                 0   0   0   |   22   0  99
>>> 4:                 0   0   0   |    2   0  99
>>> 5:                 0   0   0   |    5   0  99
>>> 6:                 0   0   0   |    2   0  99
>>> 7:                 1 100 100   |    1   0 100
>>>
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                 3  42  42   |   17   0   0
>>> 8K:                 0   0  42   |   17   0   0
>>> 16K:                 0   0  42   |   22   0   1
>>> 32K:                 0   0  42   |   73   1   2
>>> 64K:                 1  14  57   |  292   6   9
>>> 128K:                 0   0  57   |  385   8  18
>>> 256K:                 3  42 100   |   88   2  20
>>> 512K:                 0   0 100   | 1229  28  48
>>> 1M:                 0   0 100   | 2218  51 100
>>>
>>> My questions are:
>>>
>>> 1) Does a disk framentation of "1" mean that those IO was fragmented or
>>> would that be "0"?
>>>
>>> 2) Does the disk I/O size mean what lustre actually wrote or what it
>>> wanted to write?  What does that number mean in the context of our 768KB
>>> stripe size since it lists so many I/Os at 1M?
>>>
>>> Thanks,
>>> Mark
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
Mark Nelson, Lead Software Developer
Minnesota Supercomputing Institute
Phone: (612)626-4479
Email: mark at msi.umn.edu