[Lustre-discuss] Using brw_stats to diagnose lustre performance

Wed Jun 16 14:40:20 PDT 2010

On Tuesday 15 June 2010, Kevin Van Maren wrote:
> Live is much easier with a 1MB (or 512KB) native raid stripe size.
> 
> 
> It looks like most IOs are being broken into 2 pieces.  See
> https://bugzilla.lustre.org/show_bug.cgi?id=22850
> for a few tweaks that would help get IOs > 512KB to disk.  See also Bug

I played with a similar patch some time ago (blkdev defines), but didn't notice 
any performance improvements on the 9900 DDN S2A. Before increasing those values 
I got up to 7M IOs, after doubling MAX_HW_SEGMENTS and 
MAX_PHYS_SEGMENTS max IOs doubled to 14M. Unfortunately, also more IOs in between 
magic good IO sizes came up (magic good here: 1, 2, 3 ..., 14),  so e.g. lots of 
1008 or 2032, etc. Example numbers from a production system:

Length        Port 1            Port 2            Port 3            Port 4
 Kbytes    Reads   Writes    Reads   Writes    Reads   Writes    Reads   Writes

>  960     1DCD     2EEB     1E44     3532     1431     1D7E     14FB     2284                                      
 >  976     1ACD     34AC     1A0F     48EB     12E2     24AE     11E1     257F                                      
 >  992     1D46     3787     1CA7     51EB     144C     2E9B     1354     3A62                                      
 > 1008    100A5    11B5C    10391    13765     A9B8     FBED     9E9A     D457                                      
 > 1024   BFD41D  111F3C4   BFBE47  11A110D   8C316B   C95178   8E5A9F   C83850                                      
 > 1040      583      625      538      6C3      3F3      513      413      337                                      

...

 > 2032      551     1260      50D     136B      3E4     1218      3C8      BA1                                      
 > 2048    41B85    FDB21    3B8D1   101857    31088    B78E0    2C4A5    92F48                                      
 > 2064       FB       20      108       24       BE       19       C7       10                                      
 > 2080       E3       2F       E6       37       AA       44       C7       1B                                      

...

 > 7152       55      6C7       58      80C       60      70D       3F      3B4
 > 7168     449F     E335     417C     E743     3332     AB34     3686     A568
 > 7184       29        1       14        2       19        1       14        0

I don't think it matters for any storage system if max IO is 7M or 14M, but those 
sizes in between are rather annoying. And from output of brw_stats I *sometimes*
have no idea how that can happen. On that particular system I took the numbers
from, users mostly don't do streaming writes, so it the reason is clear there.

After tuning the FZJ (Kevin should know that system) 
system, the SLES11 kernel with chained scatter-gathering 
(so the blkdev patch is mostly not required anymore) can do IO sizes up to 12MB.
Unfortunately, also quite some 1008s again out of the blue without an obvious reason
(during my streaming writes with obdecho).

Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks