[lustre-discuss] Error Lustre/multipath/storage

Mon Mar 28 08:25:23 PDT 2016

This is what our multipathd.conf looks like

efaults {
        find_multipaths yes
        user_friendly_names yes
        queue_without_daemon yes
}

blacklist {
}

devices {
        device {
                vendor "(NETAPP|LSI|ENGENIO)"
                product "INF-01-00"
                product_blacklist "Universal Xport"
                path_grouping_policy "group_by_prio"
                path_checker "rdac"
                features "2 pg_init_retries 50"
                hardware_handler "1 rdac"
                prio "rdac"
                failback "immediate"
                rr_weight "uniform"
                no_path_retry 30
                retain_attached_hw_handler "yes"
                detect_prio "yes"
        }
}

On Mon, Mar 28, 2016 at 11:23 PM, Stu Midgley <sdm900 at gmail.com> wrote:
> upgrade your IS5600 firmware.  We were seeing this till we upgraded to
> the latest NetApp firmware.
>
> On Mon, Mar 28, 2016 at 10:30 PM, Ben Evans <bevans at cray.com> wrote:
>> You're getting multipathing errors, which means it's most likely not a
>> filesystem-level issue.  See if you can get the logs from the storage array
>> as well, there might be some detail there as to what is happening.
>>
>> Can you check your logs and determine if it's a single connection that is
>> always failing?  If so, can you try replacing the cable and see if that
>> clears it up?  Next would be checking to make sure that the source and
>> destination SAS ports are good.
>>
>> -Ben Evans
>>
>> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of
>> Angelo Cavalcanti <acrribeiro at gmail.com>
>> Date: Monday, March 28, 2016 at 10:01 AM
>> To: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
>> Subject: [lustre-discuss] Error Lustre/multipath/storage
>>
>> Dear all,
>>
>> We're having trouble with a lustre 2.5.3 implementation. This is our setup:
>>
>>
>> One server for MGS/MDS/MDT. MDT is served from a raid-6 backed partition of
>> 2TB (que tipo de hd?)
>>
>>
>> Two OSS/OST in a active/active HA with pacemaker. Both are connected to a
>> storage via SAS.
>>
>>
>> One SGI Infinite Storage IS5600 with two raid-6 backed volume groups. Each
>> group has two volumes, each volume has 15TB capacity.
>>
>>
>> Volumes are recognized by OSSs as multipath devices, each voulme has 4
>> paths. Volumes were created with a GPT partition table and a single
>> partition.
>>
>>
>> Volume partitions were then formatted as OSTs with the following command:
>>
>>
>> # mkfs.lustre --replace --reformat --ost --mkfsoptions=" -E
>> stride=128,stripe_width=1024"
>> --mountfsoptions="errors=remount-ro,extents,mballoc" --fsname=lustre1
>> --mgsnode=10.149.0.153 at o2ib1 --index=0 --servicenode=10.149.0.151 at o2ib1
>> --servicenode=10.149.0.152 at o2ib1
>> /dev/mapper/360080e500029eaec0000012656951fcap1
>>
>>
>> Testing with bonnie++ in a client with the below command:
>>
>> $ ./bonnie++-1.03e/bonnie++ -m lustre1 -d /mnt/lustre -s 128G:1024k -n 0 -f
>> -b -u vhpc
>>
>>
>> No problem creating files inside the lustre mount point, but *rewriting* the
>> same files results in the errors below:
>>
>>
>> Mar 18 17:46:13 oss01 multipathd: 8:128: mark as failed
>>
>> Mar 18 17:46:13 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 3
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06
>> d8 22 00 20 00 00
>>
>> Mar 18 17:46:13 oss01 kernel: __ratelimit: 109 callbacks suppressed
>>
>> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:128.
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 07
>> 18 22 00 18 00 00
>>
>> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:192.
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 06
>> d8 22 00 20 00 00
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] CDB: Read(10): 28 00 00 07
>> 18 22 00 18 00 00
>>
>> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:64.
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 07
>> 18 22 00 18 00 00
>>
>> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:0.
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 06
>> d8 22 00 20 00 00
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi -
>> rdac checker reports path is up
>>
>> Mar 18 17:46:14 oss01 multipathd: 8:128: reinstated
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 4
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 07
>> 18 22 00 18 00 00
>>
>> Mar 18 17:46:14 oss01 kernel: device-mapper: multipath: Failing path 8:128.
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
>> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>>
>> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06
>> d8 22 00 20 00 00
>>
>> Mar 18 17:46:14 oss01 multipathd: 8:128: mark as failed
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 3
>>
>> Mar 18 17:46:14 oss01 multipathd: 8:192: mark as failed
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 2
>>
>> Mar 18 17:46:14 oss01 multipathd: 8:0: mark as failed
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 1
>>
>> Mar 18 17:46:14 oss01 multipathd: 8:64: mark as failed
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> Entering recovery mode: max_retries=30
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> remaining active paths: 0
>>
>> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
>> Entering recovery mode: max_retries=30
>>
>> Mar 18 17:46:19 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi -
>> rdac checker reports path is up
>>
>>
>> Multipath configuration ( /etc/multipath.conf ) is below, and is correct
>> according to the vendor (SGI).
>>
>>
>> defaults {
>>
>>        user_friendly_names no
>>
>> }
>>
>>
>> blacklist {
>>
>>        wwid "*"
>>
>> }
>>
>>
>> blacklist_exceptions {
>>
>>        wwid "360080e500029eaec0000012656951fca"
>>
>>        wwid "360080e500029eaec0000012956951fcb"
>>
>>        wwid "360080e500029eaec0000012c56951fcb"
>>
>>        wwid "360080e500029eaec0000012f56951fcb"
>>
>> }
>>
>>
>> devices {
>>
>>       device {
>>
>>         vendor                       "SGI"
>>
>>         product                      "IS.*"
>>
>>         product_blacklist            "Universal Xport"
>>
>>         getuid_callout               "/lib/udev/scsi_id --whitelisted
>> --device=/dev/%n"
>>
>>         prio                         "rdac"
>>
>>         features                     "2 pg_init_retries 50"
>>
>>         hardware_handler             "1 rdac"
>>
>>         path_grouping_policy         "group_by_prio"
>>
>>         failback                     "immediate"
>>
>>         rr_weight                    "uniform"
>>
>>         no_path_retry                30
>>
>>         retain_attached_hw_handler   "yes"
>>
>>         detect_prio                  "yes"
>>
>>         #rr_min_io                   1000
>>
>>         path_checker                 "rdac"
>>
>>         #selector                    "round-robin 0"
>>
>>         #polling_interval            10
>>
>>       }
>>
>> }
>>
>>
>>
>> multipaths {
>>
>>        multipath {
>>
>>                wwid "360080e500029eaec0000012656951fca"
>>
>>        }
>>
>>        multipath {
>>
>>                wwid "360080e500029eaec0000012956951fcb"
>>
>>        }
>>
>>        multipath {
>>
>>                wwid "360080e500029eaec0000012c56951fcb"
>>
>>        }
>>
>>        multipath {
>>
>>                wwid "360080e500029eaec0000012f56951fcb"
>>
>>        }
>>
>> }
>>
>>
>> Many many combinations of OST formating options were tried, internal and
>> external journaling … But the same errors persist.
>>
>>
>> The same bonnie++ tests were repeated on all volumes of the storage using
>> only ext4, all successful.
>>
>>
>> Regards,
>>
>> Angelo
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
>
>
> --
> Dr Stuart Midgley
> sdm900 at sdm900.com


-- 
Dr Stuart Midgley
sdm900 at sdm900.com