[Lustre-discuss] Using brw_stats to diagnose lustre performance
Andreas Dilger
andreas.dilger at oracle.com
Tue Jun 15 14:08:09 PDT 2010
Also setting the max RPC size on the client to be 768kB would avoid
the need for each RPC to generate 2 IO requests.
It is possible with newer tune2fs to set the RAID stripe size and the
allocator (mballoc) will use that size. There is a bug open to
transfer this "optimal " size to the client, but it hasn't gotten mug
attention since most sites are set up with 1MB stripe size.
Cheers, Andreas
On 2010-06-15, at 14:19, Kevin Van Maren <kevin.van.maren at oracle.com>
wrote:
> Live is much easier with a 1MB (or 512KB) native raid stripe size.
>
>
> It looks like most IOs are being broken into 2 pieces. See
> https://bugzilla.lustre.org/show_bug.cgi?id=22850
> for a few tweaks that would help get IOs > 512KB to disk. See also
> Bug 9945
>
> But you are also seeing IOs "combined" into pieces that are between 1
> and 2 raid stripes, so set
> /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler
> does not "help" too much.
>
> There are mkfs options to tell ldiskfs your native raid stripe size.
> You probably also want to change
> the client stripe size (lfs setstripe) to be an integral multiple of
> the
> raid size (ie, not the default 1MB).
>
> Also note that those are power-of-2 buckets, so your 768KB chunks
> aren't
> going to be listed as "768".
>
> Kevin
>
>
> mark wrote:
>> Hi Everyone,
>>
>> I'm trying to diagnose some performance concerns we are having
>> about our
>> lustre deployment. It seems to be a fairly multifaceted problem
>> involving how ifort does buffered writes along with how we have
>> lustre
>> setup.
>>
>> What I've identified so far is that our raid stripe size on the
>> OSTs is
>> 768KB (6 * 128KB chunks) and the partitions are not being mounted
>> with
>> -o strip. We have 2 luns per controller and each virtual disk has 2
>> partitions with the 2nd one being the lustre file system. It is
>> possible the partitions are not aligned. Most of the client side
>> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per
>> OST,
>> etc). The journals are on separate SSDs. Our OSSes are probably
>> oversubscribed.
>>
>> What we've noticed is that with certain apps we get *really* bad
>> performance to the OSTs. As bad as 500-800KB/s to one OST. The best
>> performance I've seen to an OST is around 300MB/s, with 500MB/s being
>> more or less the upper bound limited by IB.
>>
>> Right now I'm trying to verify that fragmentation is happening like I
>> would expect given the configuration mentioned above. I just learned
>> about brw_stats, so I tried examining it for one of our OSTs (It
>> looks
>> like lustre must have been restarted recently with so little data):
>>
>> disk fragmented I/Os ios % cum % | ios % cum %
>> 1: 0 0 0 | 215 9 9
>> 2: 0 0 0 | 2004 89 98
>> 3: 0 0 0 | 22 0 99
>> 4: 0 0 0 | 2 0 99
>> 5: 0 0 0 | 5 0 99
>> 6: 0 0 0 | 2 0 99
>> 7: 1 100 100 | 1 0 100
>>
>> disk I/O size ios % cum % | ios % cum %
>> 4K: 3 42 42 | 17 0 0
>> 8K: 0 0 42 | 17 0 0
>> 16K: 0 0 42 | 22 0 1
>> 32K: 0 0 42 | 73 1 2
>> 64K: 1 14 57 | 292 6 9
>> 128K: 0 0 57 | 385 8 18
>> 256K: 3 42 100 | 88 2 20
>> 512K: 0 0 100 | 1229 28 48
>> 1M: 0 0 100 | 2218 51 100
>>
>> My questions are:
>>
>> 1) Does a disk framentation of "1" mean that those IO was
>> fragmented or
>> would that be "0"?
>>
>> 2) Does the disk I/O size mean what lustre actually wrote or what it
>> wanted to write? What does that number mean in the context of our
>> 768KB
>> stripe size since it lists so many I/Os at 1M?
>>
>> Thanks,
>> Mark
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list