[Lustre-discuss] Using brw_stats to diagnose lustre performance
Kevin Van Maren
kevin.van.maren at oracle.com
Tue Jun 15 13:19:46 PDT 2010
Live is much easier with a 1MB (or 512KB) native raid stripe size.
It looks like most IOs are being broken into 2 pieces. See
https://bugzilla.lustre.org/show_bug.cgi?id=22850
for a few tweaks that would help get IOs > 512KB to disk. See also Bug 9945
But you are also seeing IOs "combined" into pieces that are between 1
and 2 raid stripes, so set
/sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler
does not "help" too much.
There are mkfs options to tell ldiskfs your native raid stripe size.
You probably also want to change
the client stripe size (lfs setstripe) to be an integral multiple of the
raid size (ie, not the default 1MB).
Also note that those are power-of-2 buckets, so your 768KB chunks aren't
going to be listed as "768".
Kevin
mark wrote:
> Hi Everyone,
>
> I'm trying to diagnose some performance concerns we are having about our
> lustre deployment. It seems to be a fairly multifaceted problem
> involving how ifort does buffered writes along with how we have lustre
> setup.
>
> What I've identified so far is that our raid stripe size on the OSTs is
> 768KB (6 * 128KB chunks) and the partitions are not being mounted with
> -o strip. We have 2 luns per controller and each virtual disk has 2
> partitions with the 2nd one being the lustre file system. It is
> possible the partitions are not aligned. Most of the client side
> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per OST,
> etc). The journals are on separate SSDs. Our OSSes are probably
> oversubscribed.
>
> What we've noticed is that with certain apps we get *really* bad
> performance to the OSTs. As bad as 500-800KB/s to one OST. The best
> performance I've seen to an OST is around 300MB/s, with 500MB/s being
> more or less the upper bound limited by IB.
>
> Right now I'm trying to verify that fragmentation is happening like I
> would expect given the configuration mentioned above. I just learned
> about brw_stats, so I tried examining it for one of our OSTs (It looks
> like lustre must have been restarted recently with so little data):
>
> disk fragmented I/Os ios % cum % | ios % cum %
> 1: 0 0 0 | 215 9 9
> 2: 0 0 0 | 2004 89 98
> 3: 0 0 0 | 22 0 99
> 4: 0 0 0 | 2 0 99
> 5: 0 0 0 | 5 0 99
> 6: 0 0 0 | 2 0 99
> 7: 1 100 100 | 1 0 100
>
> disk I/O size ios % cum % | ios % cum %
> 4K: 3 42 42 | 17 0 0
> 8K: 0 0 42 | 17 0 0
> 16K: 0 0 42 | 22 0 1
> 32K: 0 0 42 | 73 1 2
> 64K: 1 14 57 | 292 6 9
> 128K: 0 0 57 | 385 8 18
> 256K: 3 42 100 | 88 2 20
> 512K: 0 0 100 | 1229 28 48
> 1M: 0 0 100 | 2218 51 100
>
> My questions are:
>
> 1) Does a disk framentation of "1" mean that those IO was fragmented or
> would that be "0"?
>
> 2) Does the disk I/O size mean what lustre actually wrote or what it
> wanted to write? What does that number mean in the context of our 768KB
> stripe size since it lists so many I/Os at 1M?
>
> Thanks,
> Mark
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list