[lustre-discuss] Error Lustre/multipath/storage
Angelo Cavalcanti
acrribeiro at gmail.com
Fri Mar 18 15:17:01 PDT 2016
Dear all,
We're having trouble with a lustre 2.5.3 implementation. This is our setup:
-
One server for MGS/MDS/MDT. MDT is served from a raid-6 backed partition
of 2TB (que tipo de hd?)
-
Two OSS/OST in a active/active HA with pacemaker. Both are connected to
a storage via SAS.
- One SGI Infinite Storage IS5600 with two raid-6 backed volume groups.
Each group has two volumes, each volume has 15TB capacity.
Volumes are recognized by OSSs as multipath devices, each voulme has 4
paths. Volumes were created with a GPT partition table and a single
partition.
Volume partitions were then formatted as OSTs with the following command:
# mkfs.lustre --replace --reformat --ost --mkfsoptions="-i 1048576 -E
stride=128,stripe_width=1024"
--mountfsoptions="errors=remount-ro,extents,mballoc" --fsname=lustre1
--mgsnode=10.149.0.153 at o2ib1 --index=0 --servicenode=10.149.0.151 at o2ib1
--servicenode=10.149.0.152 at o2ib1
/dev/mapper/360080e500029eaec0000012656951fcap1
Testing with bonnie++ in a client with the below command:
$ ./bonnie++-1.03e/bonnie++ -m lustre1 -d /mnt/lustre -s 128G:1024k -n 0 -f
-b -u vhpc
No problem creating files inside the lustre mount point, but *rewriting*
the same files results in the errors below:
Mar 18 17:46:13 oss01 multipathd: 8:128: mark as failed
Mar 18 17:46:13 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 3
Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06
d8 22 00 20 00 00
Mar 18 17:46:13 oss01 kernel: __ratelimit: 109 callbacks suppressed
Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:128.
Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code
Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 07
18 22 00 18 00 00
Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:192.
Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code
Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 06
d8 22 00 20 00 00
Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Unhandled error code
Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] CDB: Read(10): 28 00 00 07
18 22 00 18 00 00
Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:64.
Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code
Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 07
18 22 00 18 00 00
Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:0.
Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code
Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 06
d8 22 00 20 00 00
Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi -
rdac checker reports path is up
Mar 18 17:46:14 oss01 multipathd: 8:128: reinstated
Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 4
Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 07
18 22 00 18 00 00
Mar 18 17:46:14 oss01 kernel: device-mapper: multipath: Failing path 8:128.
Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code
Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06
d8 22 00 20 00 00
Mar 18 17:46:14 oss01 multipathd: 8:128: mark as failed
Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 3
Mar 18 17:46:14 oss01 multipathd: 8:192: mark as failed
Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 2
Mar 18 17:46:14 oss01 multipathd: 8:0: mark as failed
Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 1
Mar 18 17:46:14 oss01 multipathd: 8:64: mark as failed
Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
Entering recovery mode: max_retries=30
Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
remaining active paths: 0
Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca:
Entering recovery mode: max_retries=30
Mar 18 17:46:19 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi -
rdac checker reports path is up
Multipath configuration ( /etc/multipath.conf ) is below, and is correct
according to the vendor (SGI).
defaults {
user_friendly_names no
}
blacklist {
wwid "*"
}
blacklist_exceptions {
wwid "360080e500029eaec0000012656951fca"
wwid "360080e500029eaec0000012956951fcb"
wwid "360080e500029eaec0000012c56951fcb"
wwid "360080e500029eaec0000012f56951fcb"
}
devices {
device {
vendor "SGI"
product "IS.*"
product_blacklist "Universal Xport"
getuid_callout "/lib/udev/scsi_id --whitelisted
--device=/dev/%n"
prio "rdac"
features "2 pg_init_retries 50"
hardware_handler "1 rdac"
path_grouping_policy "group_by_prio"
failback "immediate"
rr_weight "uniform"
no_path_retry 30
retain_attached_hw_handler "yes"
detect_prio "yes"
#rr_min_io 1000
path_checker "rdac"
#selector "round-robin 0"
#polling_interval 10
}
}
multipaths {
multipath {
wwid "360080e500029eaec0000012656951fca"
}
multipath {
wwid "360080e500029eaec0000012956951fcb"
}
multipath {
wwid "360080e500029eaec0000012c56951fcb"
}
multipath {
wwid "360080e500029eaec0000012f56951fcb"
}
}
Many many combinations of OST formating options were tried, internal and
external journaling … But the same errors persist.
The same bonnie++ tests were repeated on all volumes of the storage using
only ext4, all successful.
Finally, I've used the debug daemon with the below command:
# lctl debug_daemon start /tmp/lustre.bin
Message file is attached.
Regards,
Angelo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160318/b6a757e5/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20160315-lustre.log
Type: text/x-log
Size: 55951 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160318/b6a757e5/attachment-0001.bin>
More information about the lustre-discuss
mailing list