[lustre-discuss] Error Lustre/multipath/storage

Stu Midgley sdm900 at gmail.com
Mon Mar 28 08:23:01 PDT 2016


upgrade your IS5600 firmware.  We were seeing this till we upgraded to
the latest NetApp firmware.

On Mon, Mar 28, 2016 at 10:30 PM, Ben Evans <bevans at cray.com> wrote:
> You're getting multipathing errors, which means it's most likely not a
> filesystem-level issue.  See if you can get the logs from the storage array
> as well, there might be some detail there as to what is happening.
>
> Can you check your logs and determine if it's a single connection that is
> always failing?  If so, can you try replacing the cable and see if that
> clears it up?  Next would be checking to make sure that the source and
> destination SAS ports are good.
>
> -Ben Evans
>
> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of
> Angelo Cavalcanti <acrribeiro at gmail.com>
> Date: Monday, March 28, 2016 at 10:01 AM
> To: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
> Subject: [lustre-discuss] Error Lustre/multipath/storage
>
> Dear all,
>
> We're having trouble with a lustre 2.5.3 implementation. This is our setup:
>
>
> One server for MGS/MDS/MDT. MDT is served from a raid-6 backed partition of
> 2TB (que tipo de hd?)
>
>
> Two OSS/OST in a active/active HA with pacemaker. Both are connected to a
> storage via SAS.
>
>
> One SGI Infinite Storage IS5600 with two raid-6 backed volume groups. Each
> group has two volumes, each volume has 15TB capacity.
>
>
> Volumes are recognized by OSSs as multipath devices, each voulme has 4
> paths. Volumes were created with a GPT partition table and a single
> partition.
>
>
> Volume partitions were then formatted as OSTs with the following command:
>
>
> # mkfs.lustre --replace --reformat --ost --mkfsoptions=" -E
> stride=128,stripe_width=1024"
> --mountfsoptions="errors=remount-ro,extents,mballoc" --fsname=lustre1
> --mgsnode=10.149.0.153 at o2ib1 --index=0 --servicenode=10.149.0.151 at o2ib1
> --servicenode=10.149.0.152 at o2ib1
> /dev/mapper/360080e500029eaec0000012656951fcap1
>
>
> Testing with bonnie++ in a client with the below command:
>
> $ ./bonnie++-1.03e/bonnie++ -m lustre1 -d /mnt/lustre -s 128G:1024k -n 0 -f
> -b -u vhpc
>
>
> No problem creating files inside the lustre mount point, but *rewriting* the
> same files results in the errors below:
>
>
> Mar 18 17:46:13 oss01 multipathd: 8:128: mark as failed
>
> Mar 18 17:46:13 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 3
>
> Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
>
> Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>
> Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06
> d8 22 00 20 00 00
>
> Mar 18 17:46:13 oss01 kernel: __ratelimit: 109 callbacks suppressed
>
> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:128.
>
> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code
>
> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>
> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 07
> 18 22 00 18 00 00
>
> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:192.
>
> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code
>
> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>
> Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 06
> d8 22 00 20 00 00
>
> Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Unhandled error code
>
> Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>
> Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] CDB: Read(10): 28 00 00 07
> 18 22 00 18 00 00
>
> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:64.
>
> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code
>
> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>
> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 07
> 18 22 00 18 00 00
>
> Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:0.
>
> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code
>
> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>
> Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 06
> d8 22 00 20 00 00
>
> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi -
> rdac checker reports path is up
>
> Mar 18 17:46:14 oss01 multipathd: 8:128: reinstated
>
> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 4
>
> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
>
> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>
> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 07
> 18 22 00 18 00 00
>
> Mar 18 17:46:14 oss01 kernel: device-mapper: multipath: Failing path 8:128.
>
> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
>
> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
>
> Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06
> d8 22 00 20 00 00
>
> Mar 18 17:46:14 oss01 multipathd: 8:128: mark as failed
>
> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 3
>
> Mar 18 17:46:14 oss01 multipathd: 8:192: mark as failed
>
> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 2
>
> Mar 18 17:46:14 oss01 multipathd: 8:0: mark as failed
>
> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 1
>
> Mar 18 17:46:14 oss01 multipathd: 8:64: mark as failed
>
> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> Entering recovery mode: max_retries=30
>
> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> remaining active paths: 0
>
> Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
> Entering recovery mode: max_retries=30
>
> Mar 18 17:46:19 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi -
> rdac checker reports path is up
>
>
> Multipath configuration ( /etc/multipath.conf ) is below, and is correct
> according to the vendor (SGI).
>
>
> defaults {
>
>        user_friendly_names no
>
> }
>
>
> blacklist {
>
>        wwid "*"
>
> }
>
>
> blacklist_exceptions {
>
>        wwid "360080e500029eaec0000012656951fca"
>
>        wwid "360080e500029eaec0000012956951fcb"
>
>        wwid "360080e500029eaec0000012c56951fcb"
>
>        wwid "360080e500029eaec0000012f56951fcb"
>
> }
>
>
> devices {
>
>       device {
>
>         vendor                       "SGI"
>
>         product                      "IS.*"
>
>         product_blacklist            "Universal Xport"
>
>         getuid_callout               "/lib/udev/scsi_id --whitelisted
> --device=/dev/%n"
>
>         prio                         "rdac"
>
>         features                     "2 pg_init_retries 50"
>
>         hardware_handler             "1 rdac"
>
>         path_grouping_policy         "group_by_prio"
>
>         failback                     "immediate"
>
>         rr_weight                    "uniform"
>
>         no_path_retry                30
>
>         retain_attached_hw_handler   "yes"
>
>         detect_prio                  "yes"
>
>         #rr_min_io                   1000
>
>         path_checker                 "rdac"
>
>         #selector                    "round-robin 0"
>
>         #polling_interval            10
>
>       }
>
> }
>
>
>
> multipaths {
>
>        multipath {
>
>                wwid "360080e500029eaec0000012656951fca"
>
>        }
>
>        multipath {
>
>                wwid "360080e500029eaec0000012956951fcb"
>
>        }
>
>        multipath {
>
>                wwid "360080e500029eaec0000012c56951fcb"
>
>        }
>
>        multipath {
>
>                wwid "360080e500029eaec0000012f56951fcb"
>
>        }
>
> }
>
>
> Many many combinations of OST formating options were tried, internal and
> external journaling … But the same errors persist.
>
>
> The same bonnie++ tests were repeated on all volumes of the storage using
> only ext4, all successful.
>
>
> Regards,
>
> Angelo
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>



-- 
Dr Stuart Midgley
sdm900 at sdm900.com


More information about the lustre-discuss mailing list