[Lustre-discuss] Using brw_stats to diagnose lustre performance

Tue Jun 15 13:19:46 PDT 2010

Live is much easier with a 1MB (or 512KB) native raid stripe size.

It looks like most IOs are being broken into 2 pieces.  See 
https://bugzilla.lustre.org/show_bug.cgi?id=22850
for a few tweaks that would help get IOs > 512KB to disk.  See also Bug 9945

But you are also seeing IOs "combined" into pieces that are between 1 
and 2 raid stripes, so set
/sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler 
does not "help" too much.

There are mkfs options to tell ldiskfs your native raid stripe size.  
You probably also want to change
the client stripe size (lfs setstripe) to be an integral multiple of the 
raid size (ie, not the default 1MB).

Also note that those are power-of-2 buckets, so your 768KB chunks aren't 
going to be listed as "768".

Kevin

mark wrote:
> Hi Everyone,
>
> I'm trying to diagnose some performance concerns we are having about our 
> lustre deployment.  It seems to be a fairly multifaceted problem 
> involving how ifort does buffered writes along with how we have lustre 
> setup.
>
> What I've identified so far is that our raid stripe size on the OSTs is 
> 768KB (6 * 128KB chunks) and the partitions are not being mounted with 
> -o strip.  We have 2 luns per controller and each virtual disk has 2 
> partitions with the 2nd one being the lustre file system.  It is 
> possible the partitions are not aligned.  Most of the client side 
> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per OST, 
> etc).  The journals are on separate SSDs.  Our OSSes are probably 
> oversubscribed.
>
> What we've noticed is that with certain apps we get *really* bad 
> performance to the OSTs.  As bad as 500-800KB/s to one OST.  The best 
> performance I've seen to an OST is around 300MB/s, with 500MB/s being 
> more or less the upper bound limited by IB.
>
> Right now I'm trying to verify that fragmentation is happening like I 
> would expect given the configuration mentioned above.  I just learned 
> about brw_stats, so I tried examining it for one of our OSTs (It looks 
> like lustre must have been restarted recently with so little data):
>
> disk fragmented I/Os   ios   % cum % |  ios   % cum %
> 1:		         0   0   0   |  215   9   9
> 2:		         0   0   0   | 2004  89  98
> 3:		         0   0   0   |   22   0  99
> 4:		         0   0   0   |    2   0  99
> 5:		         0   0   0   |    5   0  99
> 6:		         0   0   0   |    2   0  99
> 7:		         1 100 100   |    1   0 100
>
> disk I/O size          ios   % cum % |  ios   % cum %
> 4K:		         3  42  42   |   17   0   0
> 8K:		         0   0  42   |   17   0   0
> 16K:		         0   0  42   |   22   0   1
> 32K:		         0   0  42   |   73   1   2
> 64K:		         1  14  57   |  292   6   9
> 128K:		         0   0  57   |  385   8  18
> 256K:		         3  42 100   |   88   2  20
> 512K:		         0   0 100   | 1229  28  48
> 1M:		         0   0 100   | 2218  51 100
>
> My questions are:
>
> 1) Does a disk framentation of "1" mean that those IO was fragmented or 
> would that be "0"?
>
> 2) Does the disk I/O size mean what lustre actually wrote or what it 
> wanted to write?  What does that number mean in the context of our 768KB 
> stripe size since it lists so many I/Os at 1M?
>
> Thanks,
> Mark
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>