[Lustre-discuss] Using brw_stats to diagnose lustre performance

Mon Jun 14 14:20:36 PDT 2010

Hi Everyone,

I'm trying to diagnose some performance concerns we are having about our 
lustre deployment.  It seems to be a fairly multifaceted problem 
involving how ifort does buffered writes along with how we have lustre 
setup.

What I've identified so far is that our raid stripe size on the OSTs is 
768KB (6 * 128KB chunks) and the partitions are not being mounted with 
-o strip.  We have 2 luns per controller and each virtual disk has 2 
partitions with the 2nd one being the lustre file system.  It is 
possible the partitions are not aligned.  Most of the client side 
settings are at default (ie 8 rpcs in flight, 32MB dirty cache per OST, 
etc).  The journals are on separate SSDs.  Our OSSes are probably 
oversubscribed.

What we've noticed is that with certain apps we get *really* bad 
performance to the OSTs.  As bad as 500-800KB/s to one OST.  The best 
performance I've seen to an OST is around 300MB/s, with 500MB/s being 
more or less the upper bound limited by IB.

Right now I'm trying to verify that fragmentation is happening like I 
would expect given the configuration mentioned above.  I just learned 
about brw_stats, so I tried examining it for one of our OSTs (It looks 
like lustre must have been restarted recently with so little data):

disk fragmented I/Os   ios   % cum % |  ios   % cum %
1:		         0   0   0   |  215   9   9
2:		         0   0   0   | 2004  89  98
3:		         0   0   0   |   22   0  99
4:		         0   0   0   |    2   0  99
5:		         0   0   0   |    5   0  99
6:		         0   0   0   |    2   0  99
7:		         1 100 100   |    1   0 100

disk I/O size          ios   % cum % |  ios   % cum %
4K:		         3  42  42   |   17   0   0
8K:		         0   0  42   |   17   0   0
16K:		         0   0  42   |   22   0   1
32K:		         0   0  42   |   73   1   2
64K:		         1  14  57   |  292   6   9
128K:		         0   0  57   |  385   8  18
256K:		         3  42 100   |   88   2  20
512K:		         0   0 100   | 1229  28  48
1M:		         0   0 100   | 2218  51 100

My questions are:

1) Does a disk framentation of "1" mean that those IO was fragmented or 
would that be "0"?

2) Does the disk I/O size mean what lustre actually wrote or what it 
wanted to write?  What does that number mean in the context of our 768KB 
stripe size since it lists so many I/Os at 1M?

Thanks,
Mark