[lustre-discuss] Error Lustre/multipath/storage

Angelo Cavalcanti acrribeiro at gmail.com
Fri Mar 18 15:17:01 PDT 2016


Dear all,

We're having trouble with a lustre 2.5.3 implementation. This is our setup:


   -

   One server for MGS/MDS/MDT. MDT is served from a raid-6 backed partition
   of 2TB (que tipo de hd?)



   -

   Two OSS/OST in a active/active HA with pacemaker. Both are connected to
   a storage via SAS.



   - One SGI Infinite Storage IS5600 with two raid-6 backed volume groups.
   Each group has two volumes, each volume has 15TB capacity.


Volumes are recognized by OSSs as multipath devices, each voulme has 4
paths. Volumes were created with a GPT partition table and a single
partition.

Volume partitions were then formatted as OSTs with the following command:

# mkfs.lustre --replace --reformat --ost --mkfsoptions="-i 1048576 -E
stride=128,stripe_width=1024"
--mountfsoptions="errors=remount-ro,extents,mballoc" --fsname=lustre1
--mgsnode=10.149.0.153 at o2ib1 --index=0 --servicenode=10.149.0.151 at o2ib1
--servicenode=10.149.0.152 at o2ib1
/dev/mapper/360080e500029eaec0000012656951fcap1


Testing with bonnie++ in a client with the below command:

$ ./bonnie++-1.03e/bonnie++ -m lustre1 -d /mnt/lustre -s 128G:1024k -n 0 -f
-b -u vhpc


No problem creating files inside the lustre mount point, but *rewriting*
the same files results in the errors below:


Mar 18 17:46:13 oss01 multipathd: 8:128: mark as failed

Mar 18 17:46:13 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 3

Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06
d8 22 00 20 00 00

Mar 18 17:46:13 oss01 kernel: __ratelimit: 109 callbacks suppressed

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:128.

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 07
18 22 00 18 00 00

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:192.

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 06
d8 22 00 20 00 00

Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] CDB: Read(10): 28 00 00 07
18 22 00 18 00 00

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:64.

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 07
18 22 00 18 00 00

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:0.

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 06
d8 22 00 20 00 00

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi -
rdac checker reports path is up

Mar 18 17:46:14 oss01 multipathd: 8:128: reinstated

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 4

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 07
18 22 00 18 00 00

Mar 18 17:46:14 oss01 kernel: device-mapper: multipath: Failing path 8:128.

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06
d8 22 00 20 00 00

Mar 18 17:46:14 oss01 multipathd: 8:128: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 3

Mar 18 17:46:14 oss01 multipathd: 8:192: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 2

Mar 18 17:46:14 oss01 multipathd: 8:0: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 1

Mar 18 17:46:14 oss01 multipathd: 8:64: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
Entering recovery mode: max_retries=30

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 0

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
Entering recovery mode: max_retries=30

Mar 18 17:46:19 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi -
rdac checker reports path is up


Multipath configuration ( /etc/multipath.conf ) is below, and is correct
according to the vendor (SGI).


defaults {

       user_friendly_names no

}


blacklist {

       wwid "*"

}


blacklist_exceptions {

       wwid "360080e500029eaec0000012656951fca"

       wwid "360080e500029eaec0000012956951fcb"

       wwid "360080e500029eaec0000012c56951fcb"

       wwid "360080e500029eaec0000012f56951fcb"

}


devices {

      device {

        vendor                       "SGI"

        product                      "IS.*"

        product_blacklist            "Universal Xport"

        getuid_callout               "/lib/udev/scsi_id --whitelisted
--device=/dev/%n"

        prio                         "rdac"

        features                     "2 pg_init_retries 50"

        hardware_handler             "1 rdac"

        path_grouping_policy         "group_by_prio"

        failback                     "immediate"

        rr_weight                    "uniform"

        no_path_retry                30

        retain_attached_hw_handler   "yes"

        detect_prio                  "yes"

        #rr_min_io                   1000

        path_checker                 "rdac"

        #selector                    "round-robin 0"

        #polling_interval            10

      }

}



multipaths {

       multipath {

               wwid "360080e500029eaec0000012656951fca"

       }

       multipath {

               wwid "360080e500029eaec0000012956951fcb"

       }

       multipath {

               wwid "360080e500029eaec0000012c56951fcb"

       }

       multipath {

               wwid "360080e500029eaec0000012f56951fcb"

       }

}


Many many combinations of OST formating options were tried, internal and
external journaling … But the same errors persist.


The same bonnie++ tests were repeated on all volumes of the storage using
only ext4, all successful.


Finally, I've used the debug daemon with the below command:

# lctl debug_daemon start /tmp/lustre.bin

Message file is attached.


Regards,

Angelo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160318/b6a757e5/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20160315-lustre.log
Type: text/x-log
Size: 55951 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160318/b6a757e5/attachment-0001.bin>


More information about the lustre-discuss mailing list