[lustre-discuss] Disk failures triggered during OST creation and mounting on OSS Servers

Mon May 15 13:24:38 PDT 2023

During the installation of RHEL8.7 OS and Lustre 2.15.2, it appeared 
that the iDRAC cards were, by default, set to enable OS to iDRAC 
Pass-through. Thus Lustre mistakenly identified the iDRAC network 
interface as the primary network connection, which it attempted to use 
over the actual network.

To resolve this, we manually disconnected the iDRAC connection and input 
the correct NID into the modprobe file, which eliminated the disk 
failure issue.

Here are the steps we took to disconnect the iDRAC connection:
1. We ran nmcli connection show to list all connections (ensure that 
iDRAC is listed as "Wired connection 1" or under an assigned NAME).
2. We then ran nmcli connection delete "Wired connection 1" (or replaced 
"Wired connection 1" with the NAME assigned to the iDRAC) to delete the 
connection.

Jane

On 2023-05-13 09:25, John Hearns wrote:
> Can you say more about these networking issues?
> Good to make a note of them in case anyone sees similar in the future.
> 
> 
> On Fri, 12 May 2023, 20:40 Jane Liu via lustre-discuss,
> <lustre-discuss at lists.lustre.org> wrote:
> 
>> Hi Jeff,
>> 
>> Thanks for your response. We discovered later that the network
>> issues
>> originating from the iDRAC IP were causing the SAS driver to hang or
>> 
>> experience timeouts when trying to access the drives. This resulted
>> in
>> the drives being kicked out.
>> 
>> Once we resolved this issue, both the mkfs and mount operations
>> started
>> working fine.
>> 
>> Thanks,
>> Jane
>> 
>> On 2023-05-10 12:43, Jeff Johnson wrote:
>>> Jane,
>>> 
>>> You're having hardware errors, the codes in those mpt3sas errors
>>> define as "PL_LOGINFO_SUB_CODE_OPEN_FAILURE_ORR_TIMEOUT", or in
>> other
>>> words your SAS HBA cannot open a command dialogue with your disk.
>> I'd
>>> suspect backplane or cabling issues as an internal disk failure
>> will
>>> be reported by the target disk with its own error code. In this
>> case
>>> your HBA can't even talk to it properly.
>>> 
>>> Is sdah the partner mpath device to sdef? Or is sdah a second
>> failing
>>> disk interface?
>>> 
>>> Looking at this, I don't think your hardware is deploy-ready.
>>> 
>>> --Jeff
>>> 
>>> On Wed, May 10, 2023 at 9:29 AM Jane Liu via lustre-discuss
>>> <lustre-discuss at lists.lustre.org> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> We recently attempted to add several new OSS servers ( RHEL 8.7
>> and
>>>> Lustre 2.15.2). While creating new OSTs, I noticed that mdstat
>>>> reported
>>>> some disk failures after the mkfs, even though the disks were
>>>> functional
>>>> before the mkfs command. Our hardware admins managed to resolve
>> the
>>>> mdstat issue and restore the disks to normal operation. However,
>>>> when I
>>>> ran the mount OST command (when network had a problem and mount
>>>> command
>>>> timed out), similar problems occurred, and several disks were
>> kicked
>>>> 
>>>> out. The relevant /var/log/messages are provided below.
>>>> 
>>>> This problem was consistent across all our OSS servers. Any
>> insights
>>>> 
>>>> into the possible cause would be appreciated.
>>>> 
>>>> Jane
>>>> 
>>>> -----------------------------
>>>> 
>>>> May  9 13:33:15 sphnxoss47 kernel: LDISKFS-fs (md0): mounted
>>>> filesystem
>>>> with ordered data mode. Opts: errors=remount-ro
>>>> May  9 13:33:15 sphnxoss47 systemd[1]: tmp-mntmirJ5z.mount:
>>>> Succeeded.
>>>> May  9 13:33:16 sphnxoss47 kernel: LNet: HW NUMA nodes: 2, HW CPU
>>>> cores:
>>>> 72, npartitions: 2
>>>> May  9 13:33:16 sphnxoss47 kernel: alg: No test for adler32
>>>> (adler32-zlib)
>>>> May  9 13:33:16 sphnxoss47 kernel: Key type ._llcrypt registered
>>>> May  9 13:33:16 sphnxoss47 kernel: Key type .llcrypt registered
>>>> May  9 13:33:16 sphnxoss47 kernel: Lustre: Lustre: Build Version:
>>>> 2.15.2
>>>> May  9 13:33:16 sphnxoss47 kernel: LNet: Added LNI
>> 169.254.1.2 at tcp
>>>> [8/256/0/180]
>>>> May  9 13:33:16 sphnxoss47 kernel: LNet: Accept secure, port 988
>>>> May  9 13:33:17 sphnxoss47 kernel: LDISKFS-fs (md0): mounted
>>>> filesystem
>>>> with ordered data mode. Opts:
>>>> errors=remount-ro,no_mbcache,nodelalloc
>>>> May  9 13:33:17 sphnxoss47 kernel: Lustre: sphnx01-OST0244-osd:
>>>> enabled
>>>> 'large_dir' feature on device /dev/md0
>>>> May  9 13:33:25 sphnxoss47 systemd-logind[8609]: New session 7 of
>>>> user
>>>> root.
>>>> May  9 13:33:25 sphnxoss47 systemd[1]: Started Session 7 of user
>>>> root.
>>>> May  9 13:34:36 sphnxoss47 kernel: LustreError: 15f-b:
>>>> sphnx01-OST0244:
>>>> cannot register this server with the MGS: rc = -110. Is the MGS
>>>> running?
>>>> May  9 13:34:36 sphnxoss47 kernel: LustreError:
>>>> 45314:0:(obd_mount_server.c:2027:server_fill_super()) Unable to
>>>> start
>>>> targets: -110
>>>> May  9 13:34:36 sphnxoss47 kernel: LustreError:
>>>> 45314:0:(obd_mount_server.c:1644:server_put_super()) no obd
>>>> sphnx01-OST0244
>>>> May  9 13:34:36 sphnxoss47 kernel: LustreError:
>>>> 45314:0:(obd_mount_server.c:131:server_deregister_mount())
>>>> sphnx01-OST0244 not registered
>>>> May  9 13:34:39 sphnxoss47 kernel: Lustre: server umount
>>>> sphnx01-OST0244
>>>> complete
>>>> May  9 13:34:39 sphnxoss47 kernel: LustreError:
>>>> 45314:0:(super25.c:176:lustre_fill_super()) llite: Unable to
>> mount
>>>> <unknown>: rc = -110
>>>> May  9 13:34:40 sphnxoss47 kernel: LDISKFS-fs (md1): mounted
>>>> filesystem
>>>> with ordered data mode. Opts: errors=remount-ro
>>>> May  9 13:34:40 sphnxoss47 systemd[1]: tmp-mntXT85fz.mount:
>>>> Succeeded.
>>>> May  9 13:34:41 sphnxoss47 kernel: LDISKFS-fs (md1): mounted
>>>> filesystem
>>>> with ordered data mode. Opts:
>>>> errors=remount-ro,no_mbcache,nodelalloc
>>>> May  9 13:34:41 sphnxoss47 kernel: Lustre: sphnx01-OST0245-osd:
>>>> enabled
>>>> 'large_dir' feature on device /dev/md1
>>>> May  9 13:36:00 sphnxoss47 kernel: LustreError: 15f-b:
>>>> sphnx01-OST0245:
>>>> cannot register this server with the MGS: rc = -110. Is the MGS
>>>> running?
>>>> May  9 13:36:00 sphnxoss47 kernel: LustreError:
>>>> 46127:0:(obd_mount_server.c:2027:server_fill_super()) Unable to
>>>> start
>>>> targets: -110
>>>> May  9 13:36:00 sphnxoss47 kernel: LustreError:
>>>> 46127:0:(obd_mount_server.c:1644:server_put_super()) no obd
>>>> sphnx01-OST0245
>>>> May  9 13:36:00 sphnxoss47 kernel: LustreError:
>>>> 46127:0:(obd_mount_server.c:131:server_deregister_mount())
>>>> sphnx01-OST0245 not registered
>>>> May  9 13:36:08 sphnxoss47 kernel: Lustre: server umount
>>>> sphnx01-OST0245
>>>> complete
>>>> May  9 13:36:08 sphnxoss47 kernel: LustreError:
>>>> 46127:0:(super25.c:176:lustre_fill_super()) llite: Unable to
>> mount
>>>> <unknown>: rc = -110
>>>> May  9 13:36:08 sphnxoss47 kernel: LDISKFS-fs (md2): mounted
>>>> filesystem
>>>> with ordered data mode. Opts: errors=remount-ro
>>>> May  9 13:36:08 sphnxoss47 systemd[1]: tmp-mnt17IOaq.mount:
>>>> Succeeded.
>>>> May  9 13:36:09 sphnxoss47 kernel: LDISKFS-fs (md2): mounted
>>>> filesystem
>>>> with ordered data mode. Opts:
>>>> errors=remount-ro,no_mbcache,nodelalloc
>>>> Show less
>>>> 11:03 AM
>>>> 
>>>> -----------------------------
>>>> 
>>>> it just repeats for all of the md raids, then the errors start
>> and
>>>> the
>>>> drive fails and is disabled:
>>>> 
>>>> May  9 13:44:31 sphnxoss47 kernel: LustreError:
>>>> 48069:0:(super25.c:176:lustre_fill_super()) llite: Unable to
>> mount
>>>> <unknown>: rc = -110
>>>> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
>>>> log_info(0x3112011a):
>>>> originator(PL), code(0x12), sub_code(0x011a)
>>>> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
>>>> log_info(0x3112011a):
>>>> originator(PL), code(0x12), sub_code(0x011a)
>>>> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
>>>> log_info(0x3112011a):
>>>> originator(PL), code(0x12), sub_code(0x011a)
>>>> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
>>>> log_info(0x3112011a):
>>>> originator(PL), code(0x12), sub_code(0x011a)
>>>> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
>>>> log_info(0x3112011a):
>>>> originator(PL), code(0x12), sub_code(0x011a)
>>>> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
>>>> log_info(0x3112011a):
>>>> originator(PL), code(0x12), sub_code(0x011a)
>>>> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
>>>> log_info(0x3112011a):
>>>> originator(PL), code(0x12), sub_code(0x011a)
>>>> ....
>>>> ....
>>>> May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#1102
>>>> FAILED
>>>> Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s
>>>> May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#1102
>>>> CDB:
>>>> Read(10) 28 00 00 00 87 79 00 00 01 00
>>>> May  9 13:44:33 sphnxoss47 kernel: blk_update_request: I/O error,
>>>> dev
>>>> sdef, sector 277448 op 0x0:(READ) flags 0x84700 phys_seg 1 prio
>>>> class 0
>>>> May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#6800
>>>> FAILED
>>>> Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s
>>>> May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#6800
>>>> CDB:
>>>> Read(10) 28 00 00 00 87 dd 00 00 01 00
>>>> May  9 13:44:33 sphnxoss47 kernel: blk_update_request: I/O error,
>>>> dev
>>>> sdef, sector 278248 op 0x0:(READ) flags 0x84700 phys_seg 1 prio
>>>> class 0
>>>> May  9 13:44:33 sphnxoss47 kernel: device-mapper: multipath:
>> 253:52:
>>>> 
>>>> Failing path 128:112.
>>>> May  9 13:44:33 sphnxoss47 multipathd[6051]: sdef: mark as failed
>>>> May  9 13:44:33 sphnxoss47 multipathd[6051]: mpathae: remaining
>>>> active
>>>> paths: 1
>>>> ...
>>>> ...
>>>> May  9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0:
>>>> log_info(0x3112011a):
>>>> originator(PL), code(0x12), sub_code(0x011a)
>>>> May  9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0:
>>>> log_info(0x3112011a):
>>>> originator(PL), code(0x12), sub_code(0x011a)
>>>> May  9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0:
>>>> log_info(0x3112011a):
>>>> originator(PL), code(0x12), sub_code(0x011a)
>>>> May  9 13:44:34 sphnxoss47 kernel: md: super_written gets
>> error=-5
>>>> May  9 13:44:34 sphnxoss47 kernel: md/raid:md8: Disk failure on
>>>> dm-55,
>>>> disabling device.
>>>> May  9 13:44:34 sphnxoss47 kernel: md: super_written gets
>> error=-5
>>>> May  9 13:44:34 sphnxoss47 kernel: md/raid:md8: Operation
>> continuing
>>>> on
>>>> 9 devices.
>>>> May  9 13:44:34 sphnxoss47 multipathd[6051]: sdah: mark as failed
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> [1] [1]
>>> 
>>> --
>>> 
>>> ------------------------------
>>> Jeff Johnson
>>> Co-Founder
>>> Aeon Computing
>>> 
>>> jeff.johnson at aeoncomputing.com
>>> www.aeoncomputing.com [2] [2]
>>> t: 858-412-3810 x1001   f: 858-412-3845
>>> m: 619-204-9061
>>> 
>>> 4170 Morena Boulevard, Suite C - San Diego, CA 92117
>>> 
>>> High-Performance Computing / Lustre Filesystems / Scale-out
>> Storage
>>> 
>>> Links:
>>> ------
>>> [1]
>>> 
>> 
> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!P4SdNyxKAPE!B65twCaGe4aP1xnGrjpUnd-1OYuemL3X9zWyxfWEA54zk2tnvbhhrBFW5x9rXl7nFEkSsZpiRGIbodWHehLDQyvnK6u95iVHjg$
>>> [2]
>>> 
>> 
> https://urldefense.com/v3/__http://www.aeoncomputing.com__;!!P4SdNyxKAPE!B65twCaGe4aP1xnGrjpUnd-1OYuemL3X9zWyxfWEA54zk2tnvbhhrBFW5x9rXl7nFEkSsZpiRGIbodWHehLDQyvnK6vvMMT5RQ$
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org [1]
> 
> 
> Links:
> ------
> [1] 
> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!P4SdNyxKAPE!ENmGbJj8onJWgmw49s5KVk86KaoTMYullAdsqNGrBNU4w2sqCMGDae-1PxLFMgpP5RmkI7DRMfA4av16bOfR$
> [2] 
> https://urldefense.com/v3/__http://www.aeoncomputing.com__;!!P4SdNyxKAPE!ENmGbJj8onJWgmw49s5KVk86KaoTMYullAdsqNGrBNU4w2sqCMGDae-1PxLFMgpP5RmkI7DRMfA4aqBzCC82$