Hi Galen,<br><br>I have tried your suggestion and mounted OST directly on /dev/sd<x> devices but that didn't help and I/O is still being fragmented.<br><br>Best regards,<br><br>Wojciech<br><br><div class="gmail_quote">

On 10 June 2011 14:25, Shipman, Galen M. <span dir="ltr"><<a href="mailto:gshipman@ornl.gov">gshipman@ornl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Wojciech,<br>

<br>

We have seen similar issues with DM-Multipath. Can you experiment with going straight to the block device without DM-Multipath?<br>

<br>

Thanks,<br>

<br>

Galen<br>

<div><div></div><div class="h5"><br>

On Jun 10, 2011, at 8:00 AM, Wojciech Turek wrote:<br>

<br>

> Hi Kevin,<br>

><br>

> Thanks for very helpful answer. I tried your suggestion and recompiled the<br>

> mpt2sas driver with the following changes:<br>

><br>

> --- mpt2sas_base.h      2010-01-16 20:57:30.000000000 +0000<br>

> +++ new_mpt2sas_base.h  2011-06-10 12:53:35.000000000 +0100<br>

> @@ -83,13 +83,13 @@<br>

> #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE<br>

> #if     CONFIG_SCSI_MPT2SAS_MAX_SGE  < 16<br>

> #define MPT2SAS_SG_DEPTH       16<br>

> -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 128<br>

> -#define MPT2SAS_SG_DEPTH       128<br>

> +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 256<br>

> +#define MPT2SAS_SG_DEPTH       256<br>

> #else<br>

> #define MPT2SAS_SG_DEPTH       CONFIG_SCSI_MPT2SAS_MAX_SGE<br>

> #endif<br>

> #else<br>

> -#define MPT2SAS_SG_DEPTH       128 /* MAX_HW_SEGMENTS */<br>

> +#define MPT2SAS_SG_DEPTH       256 /* MAX_HW_SEGMENTS */<br>

> #endif<br>

><br>

> #if defined(TARGET_MODE)<br>

><br>

> However I can still that almost 50% of writes and slightly over 50% of reads<br>

> falls under 512K I/Os<br>

> I am using device-mapper-multipath to manage active/passive paths do you<br>

> think that could have something to do with the I/O fragmentation?<br>

><br>

> Best regards,<br>

><br>

> Wojciech<br>

><br>

> On 8 June 2011 17:30, Kevin Van Maren <<a href="mailto:kevin.van.maren@oracle.com">kevin.van.maren@oracle.com</a>> wrote:<br>

><br>

>> Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not in<br>

>> the rest of the kernel.  Driver limits are not normally noticed by<br>

>> (non-Lustre) people, because the default kernel limits IO to 512KB.<br>

>><br>

>> May want to see Bug 22850 for the changes required eg, for the Emulex/lpfc<br>

>> driver.<br>

>><br>

>> Glancing at the stock RHEL5 kernel, it looks like the issue is<br>

>> MPT2SAS_SG_DEPTH, which is limited to 128.  This appears to be set to match<br>

>> the default kernel limit, but it is possible there is also a driver/HW<br>

>> limit.  You should be able to increase that to 256 and see if it works...<br>

>><br>

>><br>

>> Also note that the size buckets are power-of-2, so a "1MB" entry is any IO<br>

>>> 512KB and <= 1MB.<br>

>><br>

>> If you can't get the driver to reliably do full 1MB IOs, change to a 64KB<br>

>> chunk and set max_sectors_kb to 512.  This will help ensure you get aligned,<br>

>> full-stripe writes.<br>

>><br>

>> Kevin<br>

>><br>

>><br>

>><br>

>> Wojciech Turek wrote:<br>

>><br>

>>> I am setting up a new lustre filesystem using LSI engenio based disk<br>

>>> enclosures with integrated dual RAID controllers. I configured disks<br>

>>> into 8+2 RAID6 groups using 128kb segment size (chunk size). This<br>

>>> hardware uses mpt2sas kernel module on the Linux host side. I use the<br>

>>> whole block device for an OST (to avoid any alignment issues). When<br>

>>> running sgpdd-survey I can see high through numbers (~3GB/s write,<br>

>>> 5GB/s read), Also controllers stats show that number of IOPS = number<br>

>>> of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows<br>

>>> slower results (~2GB/s write , ~2GB/s read ) and controller stats show<br>

>>> more then double IOPS than MB/s. Looking at output from iostat -m -x 1<br>

>>> and brw_stats I can see that a large number of I/O operations are<br>

>>> smaller than 1MB, mostly 512kb.  I know that there was some work done<br>

>>> on optimising the kernel block device layer to process 1MB I/O<br>

>>> requests and that those changes were committed to Lustre 1.8.5. Thus I<br>

>>> guess this I/O chopping happens below the Lustre stack, maybe in the<br>

>>> mpt2sas driver?<br>

>>><br>

>>> I am hoping that someone in Lustre community can shed some light on to<br>

>>> my problem.<br>

>>><br>

>>> In my setup I  use:<br>

>>> Lustre 1.8.5<br>

>>> CentOS-5.5<br>

>>><br>

>>> Some parameters I tuned from defaults in CentOS:<br>

>>> deadline I/O scheduler<br>

>>><br>

>>> max_hw_sectors_kb=4096<br>

>>> max_sectors_kb=1024<br>

>>><br>

>>><br>

>>> brw_stats output<br>

>>> --<br>

>>><br>

>>> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost;<br>

>>> do cat $ost/brw_stats ; done | grep "disk I/O size" -A9<br>

>>><br>

>>> disk I/O size          ios   % cum % |  ios   % cum %<br>

>>> 4K:                    206   0   0   |  521   0   0<br>

>>> 8K:                    224   0   0   |  595   0   1<br>

>>> 16K:                   105   0   1   |  479   0   1<br>

>>> 32K:                   140   0   1   | 1108   1   3<br>

>>> 64K:                   231   0   1   | 1470   1   4<br>

>>> 128K:                  536   1   2   | 2259   2   7<br>

>>> 256K:                 1762   3   6   | 5644   6  14<br>

>>> 512K:                31574  64  71   | 30431  35  50<br>

>>> 1M:                  14200  28 100   | 42143  49 100<br>

>>> --<br>

>>> disk I/O size          ios   % cum % |  ios   % cum %<br>

>>> 4K:                    187   0   0   |  457   0   0<br>

>>> 8K:                    244   0   0   |  598   0   1<br>

>>> 16K:                   109   0   1   |  481   0   1<br>

>>> 32K:                   129   0   1   | 1100   1   3<br>

>>> 64K:                   222   0   1   | 1408   1   4<br>

>>> 128K:                  514   1   2   | 2291   2   7<br>

>>> 256K:                 1718   3   6   | 5652   6  14<br>

>>> 512K:                32222  65  72   | 29810  35  49<br>

>>> 1M:                  13654  27 100   | 42202  50 100<br>

>>> --<br>

>>> disk I/O size          ios   % cum % |  ios   % cum %<br>

>>> 4K:                    196   0   0   |  551   0   0<br>

>>> 8K:                    206   0   0   |  551   0   1<br>

>>> 16K:                    79   0   0   |  513   0   1<br>

>>> 32K:                   136   0   1   | 1048   1   3<br>

>>> 64K:                   232   0   1   | 1278   1   4<br>

>>> 128K:                  540   1   2   | 2172   2   7<br>

>>> 256K:                 1681   3   6   | 5679   6  13<br>

>>> 512K:                31842  64  71   | 31705  37  51<br>

>>> 1M:                  14077  28 100   | 41789  48 100<br>

>>> --<br>

>>> disk I/O size          ios   % cum % |  ios   % cum %<br>

>>> 4K:                    190   0   0   |  486   0   0<br>

>>> 8K:                    200   0   0   |  547   0   1<br>

>>> 16K:                    93   0   0   |  448   0   1<br>

>>> 32K:                   141   0   1   | 1029   1   3<br>

>>> 64K:                   240   0   1   | 1283   1   4<br>

>>> 128K:                  558   1   2   | 2125   2   7<br>

>>> 256K:                 1716   3   6   | 5400   6  13<br>

>>> 512K:                31476  64  70   | 29029  35  48<br>

>>> 1M:                  14366  29 100   | 42454  51 100<br>

>>> --<br>

>>> disk I/O size          ios   % cum % |  ios   % cum %<br>

>>> 4K:                    209   0   0   |  511   0   0<br>

>>> 8K:                    195   0   0   |  621   0   1<br>

>>> 16K:                    79   0   0   |  558   0   1<br>

>>> 32K:                   134   0   1   | 1135   1   3<br>

>>> 64K:                   245   0   1   | 1390   1   4<br>

>>> 128K:                  509   1   2   | 2219   2   7<br>

>>> 256K:                 1715   3   6   | 5687   6  14<br>

>>> 512K:                31784  64  71   | 31172  36  50<br>

>>> 1M:                  14112  28 100   | 41719  49 100<br>

>>> --<br>

>>> disk I/O size          ios   % cum % |  ios   % cum %<br>

>>> 4K:                    201   0   0   |  500   0   0<br>

>>> 8K:                    241   0   0   |  604   0   1<br>

>>> 16K:                    82   0   1   |  584   0   1<br>

>>> 32K:                   130   0   1   | 1092   1   3<br>

>>> 64K:                   230   0   1   | 1331   1   4<br>

>>> 128K:                  547   1   2   | 2253   2   7<br>

>>> 256K:                 1695   3   6   | 5634   6  14<br>

>>> 512K:                31501  64  70   | 31836  37  51<br>

>>> 1M:                  14343  29 100   | 41517  48 100<br>

>>><br>

>>><br>

>>><br>

>><br>

>><br>

</div></div><div class="im">> _______________________________________________<br>

> Lustre-discuss mailing list<br>

> <a href="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</a><br>

</div>> blockedhttp://<a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss" target="_blank">lists.lustre.org/mailman/listinfo/lustre-discuss</a><br>

<br>

</blockquote></div><br><br>