[Lustre-discuss] HW RAID - fragmented I/O
Wojciech Turek
wjt27 at cam.ac.uk
Mon Jun 13 08:41:10 PDT 2011
Hi Galen,
I have tried your suggestion and mounted OST directly on /dev/sd<x> devices
but that didn't help and I/O is still being fragmented.
Best regards,
Wojciech
On 10 June 2011 14:25, Shipman, Galen M. <gshipman at ornl.gov> wrote:
> Wojciech,
>
> We have seen similar issues with DM-Multipath. Can you experiment with
> going straight to the block device without DM-Multipath?
>
> Thanks,
>
> Galen
>
> On Jun 10, 2011, at 8:00 AM, Wojciech Turek wrote:
>
> > Hi Kevin,
> >
> > Thanks for very helpful answer. I tried your suggestion and recompiled
> the
> > mpt2sas driver with the following changes:
> >
> > --- mpt2sas_base.h 2010-01-16 20:57:30.000000000 +0000
> > +++ new_mpt2sas_base.h 2011-06-10 12:53:35.000000000 +0100
> > @@ -83,13 +83,13 @@
> > #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE
> > #if CONFIG_SCSI_MPT2SAS_MAX_SGE < 16
> > #define MPT2SAS_SG_DEPTH 16
> > -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 128
> > -#define MPT2SAS_SG_DEPTH 128
> > +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 256
> > +#define MPT2SAS_SG_DEPTH 256
> > #else
> > #define MPT2SAS_SG_DEPTH CONFIG_SCSI_MPT2SAS_MAX_SGE
> > #endif
> > #else
> > -#define MPT2SAS_SG_DEPTH 128 /* MAX_HW_SEGMENTS */
> > +#define MPT2SAS_SG_DEPTH 256 /* MAX_HW_SEGMENTS */
> > #endif
> >
> > #if defined(TARGET_MODE)
> >
> > However I can still that almost 50% of writes and slightly over 50% of
> reads
> > falls under 512K I/Os
> > I am using device-mapper-multipath to manage active/passive paths do you
> > think that could have something to do with the I/O fragmentation?
> >
> > Best regards,
> >
> > Wojciech
> >
> > On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com>
> wrote:
> >
> >> Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not
> in
> >> the rest of the kernel. Driver limits are not normally noticed by
> >> (non-Lustre) people, because the default kernel limits IO to 512KB.
> >>
> >> May want to see Bug 22850 for the changes required eg, for the
> Emulex/lpfc
> >> driver.
> >>
> >> Glancing at the stock RHEL5 kernel, it looks like the issue is
> >> MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set to
> match
> >> the default kernel limit, but it is possible there is also a driver/HW
> >> limit. You should be able to increase that to 256 and see if it
> works...
> >>
> >>
> >> Also note that the size buckets are power-of-2, so a "1MB" entry is any
> IO
> >>> 512KB and <= 1MB.
> >>
> >> If you can't get the driver to reliably do full 1MB IOs, change to a
> 64KB
> >> chunk and set max_sectors_kb to 512. This will help ensure you get
> aligned,
> >> full-stripe writes.
> >>
> >> Kevin
> >>
> >>
> >>
> >> Wojciech Turek wrote:
> >>
> >>> I am setting up a new lustre filesystem using LSI engenio based disk
> >>> enclosures with integrated dual RAID controllers. I configured disks
> >>> into 8+2 RAID6 groups using 128kb segment size (chunk size). This
> >>> hardware uses mpt2sas kernel module on the Linux host side. I use the
> >>> whole block device for an OST (to avoid any alignment issues). When
> >>> running sgpdd-survey I can see high through numbers (~3GB/s write,
> >>> 5GB/s read), Also controllers stats show that number of IOPS = number
> >>> of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows
> >>> slower results (~2GB/s write , ~2GB/s read ) and controller stats show
> >>> more then double IOPS than MB/s. Looking at output from iostat -m -x 1
> >>> and brw_stats I can see that a large number of I/O operations are
> >>> smaller than 1MB, mostly 512kb. I know that there was some work done
> >>> on optimising the kernel block device layer to process 1MB I/O
> >>> requests and that those changes were committed to Lustre 1.8.5. Thus I
> >>> guess this I/O chopping happens below the Lustre stack, maybe in the
> >>> mpt2sas driver?
> >>>
> >>> I am hoping that someone in Lustre community can shed some light on to
> >>> my problem.
> >>>
> >>> In my setup I use:
> >>> Lustre 1.8.5
> >>> CentOS-5.5
> >>>
> >>> Some parameters I tuned from defaults in CentOS:
> >>> deadline I/O scheduler
> >>>
> >>> max_hw_sectors_kb=4096
> >>> max_sectors_kb=1024
> >>>
> >>>
> >>> brw_stats output
> >>> --
> >>>
> >>> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost;
> >>> do cat $ost/brw_stats ; done | grep "disk I/O size" -A9
> >>>
> >>> disk I/O size ios % cum % | ios % cum %
> >>> 4K: 206 0 0 | 521 0 0
> >>> 8K: 224 0 0 | 595 0 1
> >>> 16K: 105 0 1 | 479 0 1
> >>> 32K: 140 0 1 | 1108 1 3
> >>> 64K: 231 0 1 | 1470 1 4
> >>> 128K: 536 1 2 | 2259 2 7
> >>> 256K: 1762 3 6 | 5644 6 14
> >>> 512K: 31574 64 71 | 30431 35 50
> >>> 1M: 14200 28 100 | 42143 49 100
> >>> --
> >>> disk I/O size ios % cum % | ios % cum %
> >>> 4K: 187 0 0 | 457 0 0
> >>> 8K: 244 0 0 | 598 0 1
> >>> 16K: 109 0 1 | 481 0 1
> >>> 32K: 129 0 1 | 1100 1 3
> >>> 64K: 222 0 1 | 1408 1 4
> >>> 128K: 514 1 2 | 2291 2 7
> >>> 256K: 1718 3 6 | 5652 6 14
> >>> 512K: 32222 65 72 | 29810 35 49
> >>> 1M: 13654 27 100 | 42202 50 100
> >>> --
> >>> disk I/O size ios % cum % | ios % cum %
> >>> 4K: 196 0 0 | 551 0 0
> >>> 8K: 206 0 0 | 551 0 1
> >>> 16K: 79 0 0 | 513 0 1
> >>> 32K: 136 0 1 | 1048 1 3
> >>> 64K: 232 0 1 | 1278 1 4
> >>> 128K: 540 1 2 | 2172 2 7
> >>> 256K: 1681 3 6 | 5679 6 13
> >>> 512K: 31842 64 71 | 31705 37 51
> >>> 1M: 14077 28 100 | 41789 48 100
> >>> --
> >>> disk I/O size ios % cum % | ios % cum %
> >>> 4K: 190 0 0 | 486 0 0
> >>> 8K: 200 0 0 | 547 0 1
> >>> 16K: 93 0 0 | 448 0 1
> >>> 32K: 141 0 1 | 1029 1 3
> >>> 64K: 240 0 1 | 1283 1 4
> >>> 128K: 558 1 2 | 2125 2 7
> >>> 256K: 1716 3 6 | 5400 6 13
> >>> 512K: 31476 64 70 | 29029 35 48
> >>> 1M: 14366 29 100 | 42454 51 100
> >>> --
> >>> disk I/O size ios % cum % | ios % cum %
> >>> 4K: 209 0 0 | 511 0 0
> >>> 8K: 195 0 0 | 621 0 1
> >>> 16K: 79 0 0 | 558 0 1
> >>> 32K: 134 0 1 | 1135 1 3
> >>> 64K: 245 0 1 | 1390 1 4
> >>> 128K: 509 1 2 | 2219 2 7
> >>> 256K: 1715 3 6 | 5687 6 14
> >>> 512K: 31784 64 71 | 31172 36 50
> >>> 1M: 14112 28 100 | 41719 49 100
> >>> --
> >>> disk I/O size ios % cum % | ios % cum %
> >>> 4K: 201 0 0 | 500 0 0
> >>> 8K: 241 0 0 | 604 0 1
> >>> 16K: 82 0 1 | 584 0 1
> >>> 32K: 130 0 1 | 1092 1 3
> >>> 64K: 230 0 1 | 1331 1 4
> >>> 128K: 547 1 2 | 2253 2 7
> >>> 256K: 1695 3 6 | 5634 6 14
> >>> 512K: 31501 64 70 | 31836 37 51
> >>> 1M: 14343 29 100 | 41517 48 100
> >>>
> >>>
> >>>
> >>
> >>
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > blockedhttp://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110613/3b5655a1/attachment.htm>
More information about the lustre-discuss
mailing list