[lustre-discuss] Error Lustre/multipath/storage

Ben Evans bevans at cray.com
Mon Mar 28 07:30:52 PDT 2016


You're getting multipathing errors, which means it's most likely not a filesystem-level issue.  See if you can get the logs from the storage array as well, there might be some detail there as to what is happening.

Can you check your logs and determine if it's a single connection that is always failing?  If so, can you try replacing the cable and see if that clears it up?  Next would be checking to make sure that the source and destination SAS ports are good.

-Ben Evans

From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of Angelo Cavalcanti <acrribeiro at gmail.com<mailto:acrribeiro at gmail.com>>
Date: Monday, March 28, 2016 at 10:01 AM
To: "lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>" <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Subject: [lustre-discuss] Error Lustre/multipath/storage


Dear all,

We're having trouble with a lustre 2.5.3 implementation. This is our setup:


  *   One server for MGS/MDS/MDT. MDT is served from a raid-6 backed partition of 2TB (que tipo de hd?)


  *   Two OSS/OST in a active/active HA with pacemaker. Both are connected to a storage via SAS.


  *   One SGI Infinite Storage IS5600 with two raid-6 backed volume groups. Each group has two volumes, each volume has 15TB capacity.


Volumes are recognized by OSSs as multipath devices, each voulme has 4 paths. Volumes were created with a GPT partition table and a single partition.


Volume partitions were then formatted as OSTs with the following command:


# mkfs.lustre --replace --reformat --ost --mkfsoptions=" -E stride=128,stripe_width=1024" --mountfsoptions="errors=remount-ro,extents,mballoc" --fsname=lustre1 --mgsnode=10.149.0.153 at o2ib1 --index=0 --servicenode=10.149.0.151 at o2ib1 --servicenode=10.149.0.152 at o2ib1 /dev/mapper/360080e500029eaec0000012656951fcap1


Testing with bonnie++ in a client with the below command:

$ ./bonnie++-1.03e/bonnie++ -m lustre1 -d /mnt/lustre -s 128G:1024k -n 0 -f -b -u vhpc


No problem creating files inside the lustre mount point, but *rewriting* the same files results in the errors below:


Mar 18 17:46:13 oss01 multipathd: 8:128: mark as failed

Mar 18 17:46:13 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 3

Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06 d8 22 00 20 00 00

Mar 18 17:46:13 oss01 kernel: __ratelimit: 109 callbacks suppressed

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:128.

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 07 18 22 00 18 00 00

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:192.

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 06 d8 22 00 20 00 00

Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] CDB: Read(10): 28 00 00 07 18 22 00 18 00 00

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:64.

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 07 18 22 00 18 00 00

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:0.

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 06 d8 22 00 20 00 00

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi - rdac checker reports path is up

Mar 18 17:46:14 oss01 multipathd: 8:128: reinstated

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 4

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 07 18 22 00 18 00 00

Mar 18 17:46:14 oss01 kernel: device-mapper: multipath: Failing path 8:128.

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06 d8 22 00 20 00 00

Mar 18 17:46:14 oss01 multipathd: 8:128: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 3

Mar 18 17:46:14 oss01 multipathd: 8:192: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 2

Mar 18 17:46:14 oss01 multipathd: 8:0: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 1

Mar 18 17:46:14 oss01 multipathd: 8:64: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: Entering recovery mode: max_retries=30

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 0

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: Entering recovery mode: max_retries=30

Mar 18 17:46:19 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi - rdac checker reports path is up


Multipath configuration ( /etc/multipath.conf ) is below, and is correct according to the vendor (SGI).


defaults {

       user_friendly_names no

}


blacklist {

       wwid "*"

}


blacklist_exceptions {

       wwid "360080e500029eaec0000012656951fca"

       wwid "360080e500029eaec0000012956951fcb"

       wwid "360080e500029eaec0000012c56951fcb"

       wwid "360080e500029eaec0000012f56951fcb"

}


devices {

      device {

        vendor                       "SGI"

        product                      "IS.*"

        product_blacklist            "Universal Xport"

        getuid_callout               "/lib/udev/scsi_id --whitelisted --device=/dev/%n"

        prio                         "rdac"

        features                     "2 pg_init_retries 50"

        hardware_handler             "1 rdac"

        path_grouping_policy         "group_by_prio"

        failback                     "immediate"

        rr_weight                    "uniform"

        no_path_retry                30

        retain_attached_hw_handler   "yes"

        detect_prio                  "yes"

        #rr_min_io                   1000

        path_checker                 "rdac"

        #selector                    "round-robin 0"

        #polling_interval            10

      }

}



multipaths {

       multipath {

               wwid "360080e500029eaec0000012656951fca"

       }

       multipath {

               wwid "360080e500029eaec0000012956951fcb"

       }

       multipath {

               wwid "360080e500029eaec0000012c56951fcb"

       }

       multipath {

               wwid "360080e500029eaec0000012f56951fcb"

       }

}


Many many combinations of OST formating options were tried, internal and external journaling … But the same errors persist.


The same bonnie++ tests were repeated on all volumes of the storage using only ext4, all successful.


Regards,

Angelo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160328/e3bb0013/attachment-0001.htm>


More information about the lustre-discuss mailing list