[Lustre-discuss] HW RAID - fragmented I/O

Fri Jun 10 05:38:40 PDT 2011

It's possible there is another issue, but are you sure you (or RedHat) 
are not setting CONFIG_SCSI_MPT2SAS_MAX_SGE in your .config, which is 
preventing it from being set to 256?  I don't have a machine using this 
driver.

You could put #warning in the code to see if you hit the non-256 code 
path when building, or printk the max_sgl_entries in 
_base_allocate_memory_pools.

Kevin

Wojciech Turek wrote:
> Hi Kevin,
>
> Thanks for very helpful answer. I tried your suggestion and recompiled 
> the mpt2sas driver with the following changes:
>
> --- mpt2sas_base.h      2010-01-16 20:57:30.000000000 +0000
> +++ new_mpt2sas_base.h  2011-06-10 12:53:35.000000000 +0100
> @@ -83,13 +83,13 @@
> #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE
> #if     CONFIG_SCSI_MPT2SAS_MAX_SGE  < 16
> #define MPT2SAS_SG_DEPTH       16
> -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 128
> -#define MPT2SAS_SG_DEPTH       128
> +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 256
> +#define MPT2SAS_SG_DEPTH       256
> #else
> #define MPT2SAS_SG_DEPTH       CONFIG_SCSI_MPT2SAS_MAX_SGE
> #endif
> #else
> -#define MPT2SAS_SG_DEPTH       128 /* MAX_HW_SEGMENTS */
> +#define MPT2SAS_SG_DEPTH       256 /* MAX_HW_SEGMENTS */
> #endif
>
> #if defined(TARGET_MODE)
>
> However I can still that almost 50% of writes and slightly over 50% of 
> reads falls under 512K I/Os
> I am using device-mapper-multipath to manage active/passive paths do 
> you think that could have something to do with the I/O fragmentation?
>
> Best regards,
>
> Wojciech
>
> On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com 
> <mailto:kevin.van.maren at oracle.com>> wrote:
>
>     Yep, with 1.8.5 the problem is most likely in the (mpt2sas)
>     driver, not in the rest of the kernel.  Driver limits are not
>     normally noticed by (non-Lustre) people, because the default
>     kernel limits IO to 512KB.
>
>     May want to see Bug 22850 for the changes required eg, for the
>     Emulex/lpfc driver.
>
>     Glancing at the stock RHEL5 kernel, it looks like the issue is
>     MPT2SAS_SG_DEPTH, which is limited to 128.  This appears to be set
>     to match the default kernel limit, but it is possible there is
>     also a driver/HW limit.  You should be able to increase that to
>     256 and see if it works...
>
>
>     Also note that the size buckets are power-of-2, so a "1MB" entry
>     is any IO > 512KB and <= 1MB.
>
>     If you can't get the driver to reliably do full 1MB IOs, change to
>     a 64KB chunk and set max_sectors_kb to 512.  This will help ensure
>     you get aligned, full-stripe writes.
>
>     Kevin
>
>
>
>     Wojciech Turek wrote:
>
>         I am setting up a new lustre filesystem using LSI engenio
>         based disk
>         enclosures with integrated dual RAID controllers. I configured
>         disks
>         into 8+2 RAID6 groups using 128kb segment size (chunk size). This
>         hardware uses mpt2sas kernel module on the Linux host side. I
>         use the
>         whole block device for an OST (to avoid any alignment issues).
>         When
>         running sgpdd-survey I can see high through numbers (~3GB/s write,
>         5GB/s read), Also controllers stats show that number of IOPS =
>         number
>         of MB/s. However as soon as I put ldiskfs on the OSTs,
>         obdfilter shows
>         slower results (~2GB/s write , ~2GB/s read ) and controller
>         stats show
>         more then double IOPS than MB/s. Looking at output from iostat
>         -m -x 1
>         and brw_stats I can see that a large number of I/O operations are
>         smaller than 1MB, mostly 512kb.  I know that there was some
>         work done
>         on optimising the kernel block device layer to process 1MB I/O
>         requests and that those changes were committed to Lustre
>         1.8.5. Thus I
>         guess this I/O chopping happens below the Lustre stack, maybe
>         in the
>         mpt2sas driver?
>
>         I am hoping that someone in Lustre community can shed some
>         light on to
>         my problem.
>
>         In my setup I  use:
>         Lustre 1.8.5
>         CentOS-5.5
>
>         Some parameters I tuned from defaults in CentOS:
>         deadline I/O scheduler
>
>         max_hw_sectors_kb=4096
>         max_sectors_kb=1024
>