[lustre-discuss] Error Lustre/multipath/storage
Stu Midgley
sdm900 at gmail.com
Mon Mar 28 08:25:23 PDT 2016
This is what our multipathd.conf looks like
efaults {
find_multipaths yes
user_friendly_names yes
queue_without_daemon yes
}
blacklist {
}
devices {
device {
vendor "(NETAPP|LSI|ENGENIO)"
product "INF-01-00"
product_blacklist "Universal Xport"
path_grouping_policy "group_by_prio"
path_checker "rdac"
features "2 pg_init_retries 50"
hardware_handler "1 rdac"
prio "rdac"
failback "immediate"
rr_weight "uniform"
no_path_retry 30
retain_attached_hw_handler "yes"
detect_prio "yes"
}
}
On Mon, Mar 28, 2016 at 11:23 PM, Stu Midgley <sdm900 at gmail.com> wrote:
> upgrade your IS5600 firmware. We were seeing this till we upgraded to
> the latest NetApp firmware.
>
> On Mon, Mar 28, 2016 at 10:30 PM, Ben Evans <bevans at cray.com> wrote:
>> You're getting multipathing errors, which means it's most likely not a
>> filesystem-level issue. See if you can get the logs from the storage array
>> as well, there might be some detail there as to what is happening.
>>
>> Can you check your logs and determine if it's a single connection that is
>> always failing? If so, can you try replacing the cable and see if that
>> clears it up? Next would be checking to make sure that the source and
>> destination SAS ports are good.
>>
>> -Ben Evans
>>
>> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of
>> Angelo Cavalcanti <acrribeiro at gmail.com>
>> Date: Monday, March 28, 2016 at 10:01 AM
>> To: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
>> Subject: [lustre-discuss] Error Lustre/multipath/storage
>>
>> Dear all,
>>
>> We're having trouble with a lustre 2.5.3 implementation. This is our setup:
>>
>>
>> One server for MGS/MDS/MDT. MDT is served from a raid-6 backed partition of
>> 2TB (que tipo de hd?)
>>
>>
>> Two OSS/OST in a active/active HA with pacemaker. Both are connected to a
>> storage via SAS.
>>
>>
>> One SGI Infinite Storage IS5600 with two raid-6 backed volume groups. Each
>> group has two volumes, each volume has 15TB capacity.
>>
>>
>> Volumes are recognized by OSSs as multipath devices, each voulme has 4
>> paths. Volumes were created with a GPT partition table and a single
>> partition.
>>
>>
>> Volume partitions were then formatted as OSTs with the following command:
>>
>>
>> # mkfs.lustre --replace --reformat --ost --mkfsoptions=" -E
>> stride=128,stripe_width=1024"
>> --mountfsoptions="errors=remount-ro,extents,mballoc" --fsname=lustre1
>> --mgsnode=10.149.0.153 at o2ib1 --index=0 --servicenode=10.149.0.151 at o2ib1
>> --servicenode=10.149.0.152 at o2ib1
>> /dev/mapper/360080e500029eaec0000012656951fcap1
>>
>>
>> Testing with bonnie++ in a client with the below command:
>>
>> $ ./bonnie++-1.03e/bonnie++ -m lustre1 -d /mnt/lustre -s 128G:1024k -n 0 -f
>> -b -u vhpc
>>
>>
>> No problem creating files inside the lustre mount point, but *rewriting* the
>> same files results in the errors below:
>>
>>
>> Mar 18 17:46:13 oss01 multipathd: 8:128: mark as failed
>>
>> Mar 18 17:46:13 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 3
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06
>> d8 22 00 20 00 00
>>
>> Mar 18 17:46:13 oss01 kernel: __ratelimit: 109 callbacks suppressed
>>
>> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:128.
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 07
>> 18 22 00 18 00 00
>>
>> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:192.
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 06
>> d8 22 00 20 00 00
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] CDB: Read(10): 28 00 00 07
>> 18 22 00 18 00 00
>>
>> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:64.
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 07
>> 18 22 00 18 00 00
>>
>> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:0.
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 06
>> d8 22 00 20 00 00
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi -
>> rdac checker reports path is up
>>
>> Mar 18 17:46:14 oss01 multipathd: 8:128: reinstated
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 4
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 07
>> 18 22 00 18 00 00
>>
>> Mar 18 17:46:14 oss01 kernel: device-mapper: multipath: Failing path 8:128.
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06
>> d8 22 00 20 00 00
>>
>> Mar 18 17:46:14 oss01 multipathd: 8:128: mark as failed
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 3
>>
>> Mar 18 17:46:14 oss01 multipathd: 8:192: mark as failed
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 2
>>
>> Mar 18 17:46:14 oss01 multipathd: 8:0: mark as failed
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 1
>>
>> Mar 18 17:46:14 oss01 multipathd: 8:64: mark as failed
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> Entering recovery mode: max_retries=30
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 0
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> Entering recovery mode: max_retries=30
>>
>> Mar 18 17:46:19 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi -
>> rdac checker reports path is up
>>
>>
>> Multipath configuration ( /etc/multipath.conf ) is below, and is correct
>> according to the vendor (SGI).
>>
>>
>> defaults {
>>
>> user_friendly_names no
>>
>> }
>>
>>
>> blacklist {
>>
>> wwid "*"
>>
>> }
>>
>>
>> blacklist_exceptions {
>>
>> wwid "360080e500029eaec0000012656951fca"
>>
>> wwid "360080e500029eaec0000012956951fcb"
>>
>> wwid "360080e500029eaec0000012c56951fcb"
>>
>> wwid "360080e500029eaec0000012f56951fcb"
>>
>> }
>>
>>
>> devices {
>>
>> device {
>>
>> vendor "SGI"
>>
>> product "IS.*"
>>
>> product_blacklist "Universal Xport"
>>
>> getuid_callout "/lib/udev/scsi_id --whitelisted
>> --device=/dev/%n"
>>
>> prio "rdac"
>>
>> features "2 pg_init_retries 50"
>>
>> hardware_handler "1 rdac"
>>
>> path_grouping_policy "group_by_prio"
>>
>> failback "immediate"
>>
>> rr_weight "uniform"
>>
>> no_path_retry 30
>>
>> retain_attached_hw_handler "yes"
>>
>> detect_prio "yes"
>>
>> #rr_min_io 1000
>>
>> path_checker "rdac"
>>
>> #selector "round-robin 0"
>>
>> #polling_interval 10
>>
>> }
>>
>> }
>>
>>
>>
>> multipaths {
>>
>> multipath {
>>
>> wwid "360080e500029eaec0000012656951fca"
>>
>> }
>>
>> multipath {
>>
>> wwid "360080e500029eaec0000012956951fcb"
>>
>> }
>>
>> multipath {
>>
>> wwid "360080e500029eaec0000012c56951fcb"
>>
>> }
>>
>> multipath {
>>
>> wwid "360080e500029eaec0000012f56951fcb"
>>
>> }
>>
>> }
>>
>>
>> Many many combinations of OST formating options were tried, internal and
>> external journaling … But the same errors persist.
>>
>>
>> The same bonnie++ tests were repeated on all volumes of the storage using
>> only ext4, all successful.
>>
>>
>> Regards,
>>
>> Angelo
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
>
>
> --
> Dr Stuart Midgley
> sdm900 at sdm900.com
--
Dr Stuart Midgley
sdm900 at sdm900.com
More information about the lustre-discuss
mailing list