[lustre-discuss] Error Lustre/multipath/storage
Angelo
acrribeiro at gmail.com
Tue Mar 29 06:12:53 PDT 2016
Thank you very much, Nate.
It's works.
I setup the "max_sectors" parameter to "4096":
# cat /etc/modprobe.d/mpt2sas.conf
options mpt2sas max_sectors=4096
And the bonnie++ tests were sucessfully executed.
Regards,
Angelo
2016-03-28 19:55 GMT-03:00 Nate Pearlstein <darknater at darknater.org>:
> I thought I responded to the entire list but only sent to Angelo,
>
> Very likely, lustre on the oss nodes is setting the max_sectors_kb all the
> way up to max_hw_sectors_kb and this value ends up being too large for the
> sas hca. You should set max_sectors for you mpt2sas to something smaller
> like 4096, rebuild the initrd and this will put a better limit on
> max_hw_sectors_kb for the is5600 luns…
>
>
> > On Mar 28, 2016, at 6:51 PM, Dilger, Andreas <andreas.dilger at intel.com>
> wrote:
> >
> > On 2016/03/28, 08:01, "lustre-discuss on behalf of Angelo Cavalcanti" <
> lustre-discuss-bounces at lists.lustre.org<mailto:
> lustre-discuss-bounces at lists.lustre.org> on behalf of acrribeiro at gmail.com
> <mailto:acrribeiro at gmail.com>> wrote:
> >
> >
> > Dear all,
> >
> > We're having trouble with a lustre 2.5.3 implementation. This is our
> setup:
> >
> >
> > * One server for MGS/MDS/MDT. MDT is served from a raid-6 backed
> partition of 2TB (que tipo de hd?)
> >
> > Note that using RAID-6 for the MDT storage will significantly hurt your
> metadata
> > performance, since this will incur a lot of read-modify-write overhead
> when doing
> > 4KB metadata block updates.
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Lustre Principal Architect
> > Intel High Performance Data Division
> >
> >
> > * Two OSS/OST in a active/active HA with pacemaker. Both are
> connected to a storage via SAS.
> >
> >
> > * One SGI Infinite Storage IS5600 with two raid-6 backed volume
> groups. Each group has two volumes, each volume has 15TB capacity.
> >
> >
> > Volumes are recognized by OSSs as multipath devices, each voulme has 4
> paths. Volumes were created with a GPT partition table and a single
> partition.
> >
> >
> > Volume partitions were then formatted as OSTs with the following command:
> >
> >
> > # mkfs.lustre --replace --reformat --ost --mkfsoptions=" -E
> stride=128,stripe_width=1024"
> --mountfsoptions="errors=remount-ro,extents,mballoc" --fsname=lustre1
> --mgsnode=10.149.0.153 at o2ib1 --index=0 --servicenode=10.149.0.151 at o2ib1
> --servicenode=10.149.0.152 at o2ib1
> /dev/mapper/360080e500029eaec0000012656951fcap1
> >
> >
> > Testing with bonnie++ in a client with the below command:
> >
> > $ ./bonnie++-1.03e/bonnie++ -m lustre1 -d /mnt/lustre -s 128G:1024k -n 0
> -f -b -u vhpc
> >
> >
> > No problem creating files inside the lustre mount point, but *rewriting*
> the same files results in the errors below:
> >
> >
> > Mar 18 17:46:13 oss01 multipathd: 8:128: mark as failed
> >
> > Mar 18 17:46:13 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 3
> >
> > Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
> >
> > Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
> >
> > Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00
> 06 d8 22 00 20 00 00
> >
> > Mar 18 17:46:13 oss01 kernel: __ratelimit: 109 callbacks suppressed
> >
> > Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path
> 8:128.
> >
> > Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code
> >
> > Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
> >
> > Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00
> 07 18 22 00 18 00 00
> >
> > Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path
> 8:192.
> >
> > Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code
> >
> > Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
> >
> > Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00
> 06 d8 22 00 20 00 00
> >
> > Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Unhandled error code
> >
> > Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
> >
> > Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] CDB: Read(10): 28 00 00
> 07 18 22 00 18 00 00
> >
> > Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path
> 8:64.
> >
> > Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code
> >
> > Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
> >
> > Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00
> 07 18 22 00 18 00 00
> >
> > Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:0.
> >
> > Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code
> >
> > Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
> >
> > Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00
> 06 d8 22 00 20 00 00
> >
> > Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi
> - rdac checker reports path is up
> >
> > Mar 18 17:46:14 oss01 multipathd: 8:128: reinstated
> >
> > Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 4
> >
> > Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
> >
> > Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
> >
> > Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00
> 07 18 22 00 18 00 00
> >
> > Mar 18 17:46:14 oss01 kernel: device-mapper: multipath: Failing path
> 8:128.
> >
> > Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
> >
> > Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
> >
> > Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00
> 06 d8 22 00 20 00 00
> >
> > Mar 18 17:46:14 oss01 multipathd: 8:128: mark as failed
> >
> > Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 3
> >
> > Mar 18 17:46:14 oss01 multipathd: 8:192: mark as failed
> >
> > Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 2
> >
> > Mar 18 17:46:14 oss01 multipathd: 8:0: mark as failed
> >
> > Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 1
> >
> > Mar 18 17:46:14 oss01 multipathd: 8:64: mark as failed
> >
> > Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> Entering recovery mode: max_retries=30
> >
> > Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 0
> >
> > Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> Entering recovery mode: max_retries=30
> >
> > Mar 18 17:46:19 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi
> - rdac checker reports path is up
> >
> >
> > Multipath configuration ( /etc/multipath.conf ) is below, and is correct
> according to the vendor (SGI).
> >
> >
> > defaults {
> >
> > user_friendly_names no
> >
> > }
> >
> >
> > blacklist {
> >
> > wwid "*"
> >
> > }
> >
> >
> > blacklist_exceptions {
> >
> > wwid "360080e500029eaec0000012656951fca"
> >
> > wwid "360080e500029eaec0000012956951fcb"
> >
> > wwid "360080e500029eaec0000012c56951fcb"
> >
> > wwid "360080e500029eaec0000012f56951fcb"
> >
> > }
> >
> >
> > devices {
> >
> > device {
> >
> > vendor "SGI"
> >
> > product "IS.*"
> >
> > product_blacklist "Universal Xport"
> >
> > getuid_callout "/lib/udev/scsi_id --whitelisted
> --device=/dev/%n"
> >
> > prio "rdac"
> >
> > features "2 pg_init_retries 50"
> >
> > hardware_handler "1 rdac"
> >
> > path_grouping_policy "group_by_prio"
> >
> > failback "immediate"
> >
> > rr_weight "uniform"
> >
> > no_path_retry 30
> >
> > retain_attached_hw_handler "yes"
> >
> > detect_prio "yes"
> >
> > #rr_min_io 1000
> >
> > path_checker "rdac"
> >
> > #selector "round-robin 0"
> >
> > #polling_interval 10
> >
> > }
> >
> > }
> >
> >
> >
> > multipaths {
> >
> > multipath {
> >
> > wwid "360080e500029eaec0000012656951fca"
> >
> > }
> >
> > multipath {
> >
> > wwid "360080e500029eaec0000012956951fcb"
> >
> > }
> >
> > multipath {
> >
> > wwid "360080e500029eaec0000012c56951fcb"
> >
> > }
> >
> > multipath {
> >
> > wwid "360080e500029eaec0000012f56951fcb"
> >
> > }
> >
> > }
> >
> >
> > Many many combinations of OST formating options were tried, internal and
> external journaling … But the same errors persist.
> >
> >
> > The same bonnie++ tests were repeated on all volumes of the storage
> using only ext4, all successful.
> >
> >
> > Regards,
> >
> > Angelo
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160329/7c8014b8/attachment-0001.htm>
More information about the lustre-discuss
mailing list