[Lustre-discuss] HW RAID - fragmented I/O
Kevin Van Maren
kevin.van.maren at oracle.com
Fri Jun 10 05:38:40 PDT 2011
It's possible there is another issue, but are you sure you (or RedHat)
are not setting CONFIG_SCSI_MPT2SAS_MAX_SGE in your .config, which is
preventing it from being set to 256? I don't have a machine using this
driver.
You could put #warning in the code to see if you hit the non-256 code
path when building, or printk the max_sgl_entries in
_base_allocate_memory_pools.
Kevin
Wojciech Turek wrote:
> Hi Kevin,
>
> Thanks for very helpful answer. I tried your suggestion and recompiled
> the mpt2sas driver with the following changes:
>
> --- mpt2sas_base.h 2010-01-16 20:57:30.000000000 +0000
> +++ new_mpt2sas_base.h 2011-06-10 12:53:35.000000000 +0100
> @@ -83,13 +83,13 @@
> #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE
> #if CONFIG_SCSI_MPT2SAS_MAX_SGE < 16
> #define MPT2SAS_SG_DEPTH 16
> -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 128
> -#define MPT2SAS_SG_DEPTH 128
> +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 256
> +#define MPT2SAS_SG_DEPTH 256
> #else
> #define MPT2SAS_SG_DEPTH CONFIG_SCSI_MPT2SAS_MAX_SGE
> #endif
> #else
> -#define MPT2SAS_SG_DEPTH 128 /* MAX_HW_SEGMENTS */
> +#define MPT2SAS_SG_DEPTH 256 /* MAX_HW_SEGMENTS */
> #endif
>
> #if defined(TARGET_MODE)
>
> However I can still that almost 50% of writes and slightly over 50% of
> reads falls under 512K I/Os
> I am using device-mapper-multipath to manage active/passive paths do
> you think that could have something to do with the I/O fragmentation?
>
> Best regards,
>
> Wojciech
>
> On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com
> <mailto:kevin.van.maren at oracle.com>> wrote:
>
> Yep, with 1.8.5 the problem is most likely in the (mpt2sas)
> driver, not in the rest of the kernel. Driver limits are not
> normally noticed by (non-Lustre) people, because the default
> kernel limits IO to 512KB.
>
> May want to see Bug 22850 for the changes required eg, for the
> Emulex/lpfc driver.
>
> Glancing at the stock RHEL5 kernel, it looks like the issue is
> MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set
> to match the default kernel limit, but it is possible there is
> also a driver/HW limit. You should be able to increase that to
> 256 and see if it works...
>
>
> Also note that the size buckets are power-of-2, so a "1MB" entry
> is any IO > 512KB and <= 1MB.
>
> If you can't get the driver to reliably do full 1MB IOs, change to
> a 64KB chunk and set max_sectors_kb to 512. This will help ensure
> you get aligned, full-stripe writes.
>
> Kevin
>
>
>
> Wojciech Turek wrote:
>
> I am setting up a new lustre filesystem using LSI engenio
> based disk
> enclosures with integrated dual RAID controllers. I configured
> disks
> into 8+2 RAID6 groups using 128kb segment size (chunk size). This
> hardware uses mpt2sas kernel module on the Linux host side. I
> use the
> whole block device for an OST (to avoid any alignment issues).
> When
> running sgpdd-survey I can see high through numbers (~3GB/s write,
> 5GB/s read), Also controllers stats show that number of IOPS =
> number
> of MB/s. However as soon as I put ldiskfs on the OSTs,
> obdfilter shows
> slower results (~2GB/s write , ~2GB/s read ) and controller
> stats show
> more then double IOPS than MB/s. Looking at output from iostat
> -m -x 1
> and brw_stats I can see that a large number of I/O operations are
> smaller than 1MB, mostly 512kb. I know that there was some
> work done
> on optimising the kernel block device layer to process 1MB I/O
> requests and that those changes were committed to Lustre
> 1.8.5. Thus I
> guess this I/O chopping happens below the Lustre stack, maybe
> in the
> mpt2sas driver?
>
> I am hoping that someone in Lustre community can shed some
> light on to
> my problem.
>
> In my setup I use:
> Lustre 1.8.5
> CentOS-5.5
>
> Some parameters I tuned from defaults in CentOS:
> deadline I/O scheduler
>
> max_hw_sectors_kb=4096
> max_sectors_kb=1024
>
More information about the lustre-discuss
mailing list