[lustre-discuss] Disk failures triggered during OST creation and mounting on OSS Servers

Sat May 13 06:25:27 PDT 2023

Can you say more about these networking issues?
Good to make a note of them in case anyone sees similar in the future.

On Fri, 12 May 2023, 20:40 Jane Liu via lustre-discuss, <
lustre-discuss at lists.lustre.org> wrote:

> Hi Jeff,
>
> Thanks for your response. We discovered later that the network issues
> originating from the iDRAC IP were causing the SAS driver to hang or
> experience timeouts when trying to access the drives. This resulted in
> the drives being kicked out.
>
> Once we resolved this issue, both the mkfs and mount operations started
> working fine.
>
> Thanks,
> Jane
>
>
>
> On 2023-05-10 12:43, Jeff Johnson wrote:
> > Jane,
> >
> > You're having hardware errors, the codes in those mpt3sas errors
> > define as "PL_LOGINFO_SUB_CODE_OPEN_FAILURE_ORR_TIMEOUT", or in other
> > words your SAS HBA cannot open a command dialogue with your disk. I'd
> > suspect backplane or cabling issues as an internal disk failure will
> > be reported by the target disk with its own error code. In this case
> > your HBA can't even talk to it properly.
> >
> > Is sdah the partner mpath device to sdef? Or is sdah a second failing
> > disk interface?
> >
> > Looking at this, I don't think your hardware is deploy-ready.
> >
> > --Jeff
> >
> > On Wed, May 10, 2023 at 9:29 AM Jane Liu via lustre-discuss
> > <lustre-discuss at lists.lustre.org> wrote:
> >
> >> Hi,
> >>
> >> We recently attempted to add several new OSS servers ( RHEL 8.7 and
> >> Lustre 2.15.2). While creating new OSTs, I noticed that mdstat
> >> reported
> >> some disk failures after the mkfs, even though the disks were
> >> functional
> >> before the mkfs command. Our hardware admins managed to resolve the
> >> mdstat issue and restore the disks to normal operation. However,
> >> when I
> >> ran the mount OST command (when network had a problem and mount
> >> command
> >> timed out), similar problems occurred, and several disks were kicked
> >>
> >> out. The relevant /var/log/messages are provided below.
> >>
> >> This problem was consistent across all our OSS servers. Any insights
> >>
> >> into the possible cause would be appreciated.
> >>
> >> Jane
> >>
> >> -----------------------------
> >>
> >> May  9 13:33:15 sphnxoss47 kernel: LDISKFS-fs (md0): mounted
> >> filesystem
> >> with ordered data mode. Opts: errors=remount-ro
> >> May  9 13:33:15 sphnxoss47 systemd[1]: tmp-mntmirJ5z.mount:
> >> Succeeded.
> >> May  9 13:33:16 sphnxoss47 kernel: LNet: HW NUMA nodes: 2, HW CPU
> >> cores:
> >> 72, npartitions: 2
> >> May  9 13:33:16 sphnxoss47 kernel: alg: No test for adler32
> >> (adler32-zlib)
> >> May  9 13:33:16 sphnxoss47 kernel: Key type ._llcrypt registered
> >> May  9 13:33:16 sphnxoss47 kernel: Key type .llcrypt registered
> >> May  9 13:33:16 sphnxoss47 kernel: Lustre: Lustre: Build Version:
> >> 2.15.2
> >> May  9 13:33:16 sphnxoss47 kernel: LNet: Added LNI 169.254.1.2 at tcp
> >> [8/256/0/180]
> >> May  9 13:33:16 sphnxoss47 kernel: LNet: Accept secure, port 988
> >> May  9 13:33:17 sphnxoss47 kernel: LDISKFS-fs (md0): mounted
> >> filesystem
> >> with ordered data mode. Opts:
> >> errors=remount-ro,no_mbcache,nodelalloc
> >> May  9 13:33:17 sphnxoss47 kernel: Lustre: sphnx01-OST0244-osd:
> >> enabled
> >> 'large_dir' feature on device /dev/md0
> >> May  9 13:33:25 sphnxoss47 systemd-logind[8609]: New session 7 of
> >> user
> >> root.
> >> May  9 13:33:25 sphnxoss47 systemd[1]: Started Session 7 of user
> >> root.
> >> May  9 13:34:36 sphnxoss47 kernel: LustreError: 15f-b:
> >> sphnx01-OST0244:
> >> cannot register this server with the MGS: rc = -110. Is the MGS
> >> running?
> >> May  9 13:34:36 sphnxoss47 kernel: LustreError:
> >> 45314:0:(obd_mount_server.c:2027:server_fill_super()) Unable to
> >> start
> >> targets: -110
> >> May  9 13:34:36 sphnxoss47 kernel: LustreError:
> >> 45314:0:(obd_mount_server.c:1644:server_put_super()) no obd
> >> sphnx01-OST0244
> >> May  9 13:34:36 sphnxoss47 kernel: LustreError:
> >> 45314:0:(obd_mount_server.c:131:server_deregister_mount())
> >> sphnx01-OST0244 not registered
> >> May  9 13:34:39 sphnxoss47 kernel: Lustre: server umount
> >> sphnx01-OST0244
> >> complete
> >> May  9 13:34:39 sphnxoss47 kernel: LustreError:
> >> 45314:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount
> >> <unknown>: rc = -110
> >> May  9 13:34:40 sphnxoss47 kernel: LDISKFS-fs (md1): mounted
> >> filesystem
> >> with ordered data mode. Opts: errors=remount-ro
> >> May  9 13:34:40 sphnxoss47 systemd[1]: tmp-mntXT85fz.mount:
> >> Succeeded.
> >> May  9 13:34:41 sphnxoss47 kernel: LDISKFS-fs (md1): mounted
> >> filesystem
> >> with ordered data mode. Opts:
> >> errors=remount-ro,no_mbcache,nodelalloc
> >> May  9 13:34:41 sphnxoss47 kernel: Lustre: sphnx01-OST0245-osd:
> >> enabled
> >> 'large_dir' feature on device /dev/md1
> >> May  9 13:36:00 sphnxoss47 kernel: LustreError: 15f-b:
> >> sphnx01-OST0245:
> >> cannot register this server with the MGS: rc = -110. Is the MGS
> >> running?
> >> May  9 13:36:00 sphnxoss47 kernel: LustreError:
> >> 46127:0:(obd_mount_server.c:2027:server_fill_super()) Unable to
> >> start
> >> targets: -110
> >> May  9 13:36:00 sphnxoss47 kernel: LustreError:
> >> 46127:0:(obd_mount_server.c:1644:server_put_super()) no obd
> >> sphnx01-OST0245
> >> May  9 13:36:00 sphnxoss47 kernel: LustreError:
> >> 46127:0:(obd_mount_server.c:131:server_deregister_mount())
> >> sphnx01-OST0245 not registered
> >> May  9 13:36:08 sphnxoss47 kernel: Lustre: server umount
> >> sphnx01-OST0245
> >> complete
> >> May  9 13:36:08 sphnxoss47 kernel: LustreError:
> >> 46127:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount
> >> <unknown>: rc = -110
> >> May  9 13:36:08 sphnxoss47 kernel: LDISKFS-fs (md2): mounted
> >> filesystem
> >> with ordered data mode. Opts: errors=remount-ro
> >> May  9 13:36:08 sphnxoss47 systemd[1]: tmp-mnt17IOaq.mount:
> >> Succeeded.
> >> May  9 13:36:09 sphnxoss47 kernel: LDISKFS-fs (md2): mounted
> >> filesystem
> >> with ordered data mode. Opts:
> >> errors=remount-ro,no_mbcache,nodelalloc
> >> Show less
> >> 11:03 AM
> >>
> >> -----------------------------
> >>
> >> it just repeats for all of the md raids, then the errors start and
> >> the
> >> drive fails and is disabled:
> >>
> >> May  9 13:44:31 sphnxoss47 kernel: LustreError:
> >> 48069:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount
> >> <unknown>: rc = -110
> >> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
> >> log_info(0x3112011a):
> >> originator(PL), code(0x12), sub_code(0x011a)
> >> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
> >> log_info(0x3112011a):
> >> originator(PL), code(0x12), sub_code(0x011a)
> >> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
> >> log_info(0x3112011a):
> >> originator(PL), code(0x12), sub_code(0x011a)
> >> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
> >> log_info(0x3112011a):
> >> originator(PL), code(0x12), sub_code(0x011a)
> >> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
> >> log_info(0x3112011a):
> >> originator(PL), code(0x12), sub_code(0x011a)
> >> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
> >> log_info(0x3112011a):
> >> originator(PL), code(0x12), sub_code(0x011a)
> >> May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
> >> log_info(0x3112011a):
> >> originator(PL), code(0x12), sub_code(0x011a)
> >> ....
> >> ....
> >> May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#1102
> >> FAILED
> >> Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s
> >> May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#1102
> >> CDB:
> >> Read(10) 28 00 00 00 87 79 00 00 01 00
> >> May  9 13:44:33 sphnxoss47 kernel: blk_update_request: I/O error,
> >> dev
> >> sdef, sector 277448 op 0x0:(READ) flags 0x84700 phys_seg 1 prio
> >> class 0
> >> May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#6800
> >> FAILED
> >> Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s
> >> May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#6800
> >> CDB:
> >> Read(10) 28 00 00 00 87 dd 00 00 01 00
> >> May  9 13:44:33 sphnxoss47 kernel: blk_update_request: I/O error,
> >> dev
> >> sdef, sector 278248 op 0x0:(READ) flags 0x84700 phys_seg 1 prio
> >> class 0
> >> May  9 13:44:33 sphnxoss47 kernel: device-mapper: multipath: 253:52:
> >>
> >> Failing path 128:112.
> >> May  9 13:44:33 sphnxoss47 multipathd[6051]: sdef: mark as failed
> >> May  9 13:44:33 sphnxoss47 multipathd[6051]: mpathae: remaining
> >> active
> >> paths: 1
> >> ...
> >> ...
> >> May  9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0:
> >> log_info(0x3112011a):
> >> originator(PL), code(0x12), sub_code(0x011a)
> >> May  9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0:
> >> log_info(0x3112011a):
> >> originator(PL), code(0x12), sub_code(0x011a)
> >> May  9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0:
> >> log_info(0x3112011a):
> >> originator(PL), code(0x12), sub_code(0x011a)
> >> May  9 13:44:34 sphnxoss47 kernel: md: super_written gets error=-5
> >> May  9 13:44:34 sphnxoss47 kernel: md/raid:md8: Disk failure on
> >> dm-55,
> >> disabling device.
> >> May  9 13:44:34 sphnxoss47 kernel: md: super_written gets error=-5
> >> May  9 13:44:34 sphnxoss47 kernel: md/raid:md8: Operation continuing
> >> on
> >> 9 devices.
> >> May  9 13:44:34 sphnxoss47 multipathd[6051]: sdah: mark as failed
> >> _______________________________________________
> >> lustre-discuss mailing list
> >> lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org [1]
> >
> > --
> >
> > ------------------------------
> > Jeff Johnson
> > Co-Founder
> > Aeon Computing
> >
> > jeff.johnson at aeoncomputing.com
> > www.aeoncomputing.com [2]
> > t: 858-412-3810 x1001   f: 858-412-3845
> > m: 619-204-9061
> >
> > 4170 Morena Boulevard, Suite C - San Diego, CA 92117
> >
> > High-Performance Computing / Lustre Filesystems / Scale-out Storage
> >
> > Links:
> > ------
> > [1]
> >
> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!P4SdNyxKAPE!B65twCaGe4aP1xnGrjpUnd-1OYuemL3X9zWyxfWEA54zk2tnvbhhrBFW5x9rXl7nFEkSsZpiRGIbodWHehLDQyvnK6u95iVHjg$
> > [2]
> >
> https://urldefense.com/v3/__http://www.aeoncomputing.com__;!!P4SdNyxKAPE!B65twCaGe4aP1xnGrjpUnd-1OYuemL3X9zWyxfWEA54zk2tnvbhhrBFW5x9rXl7nFEkSsZpiRGIbodWHehLDQyvnK6vvMMT5RQ$
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230513/6149877d/attachment.htm>