[lustre-discuss] Disk failures triggered during OST creation and mounting on OSS Servers
Jane Liu
zhliu at rcf.rhic.bnl.gov
Wed May 10 09:28:34 PDT 2023
Hi,
We recently attempted to add several new OSS servers ( RHEL 8.7 and
Lustre 2.15.2). While creating new OSTs, I noticed that mdstat reported
some disk failures after the mkfs, even though the disks were functional
before the mkfs command. Our hardware admins managed to resolve the
mdstat issue and restore the disks to normal operation. However, when I
ran the mount OST command (when network had a problem and mount command
timed out), similar problems occurred, and several disks were kicked
out. The relevant /var/log/messages are provided below.
This problem was consistent across all our OSS servers. Any insights
into the possible cause would be appreciated.
Jane
-----------------------------
May 9 13:33:15 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem
with ordered data mode. Opts: errors=remount-ro
May 9 13:33:15 sphnxoss47 systemd[1]: tmp-mntmirJ5z.mount: Succeeded.
May 9 13:33:16 sphnxoss47 kernel: LNet: HW NUMA nodes: 2, HW CPU cores:
72, npartitions: 2
May 9 13:33:16 sphnxoss47 kernel: alg: No test for adler32
(adler32-zlib)
May 9 13:33:16 sphnxoss47 kernel: Key type ._llcrypt registered
May 9 13:33:16 sphnxoss47 kernel: Key type .llcrypt registered
May 9 13:33:16 sphnxoss47 kernel: Lustre: Lustre: Build Version: 2.15.2
May 9 13:33:16 sphnxoss47 kernel: LNet: Added LNI 169.254.1.2 at tcp
[8/256/0/180]
May 9 13:33:16 sphnxoss47 kernel: LNet: Accept secure, port 988
May 9 13:33:17 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem
with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
May 9 13:33:17 sphnxoss47 kernel: Lustre: sphnx01-OST0244-osd: enabled
'large_dir' feature on device /dev/md0
May 9 13:33:25 sphnxoss47 systemd-logind[8609]: New session 7 of user
root.
May 9 13:33:25 sphnxoss47 systemd[1]: Started Session 7 of user root.
May 9 13:34:36 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0244:
cannot register this server with the MGS: rc = -110. Is the MGS running?
May 9 13:34:36 sphnxoss47 kernel: LustreError:
45314:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start
targets: -110
May 9 13:34:36 sphnxoss47 kernel: LustreError:
45314:0:(obd_mount_server.c:1644:server_put_super()) no obd
sphnx01-OST0244
May 9 13:34:36 sphnxoss47 kernel: LustreError:
45314:0:(obd_mount_server.c:131:server_deregister_mount())
sphnx01-OST0244 not registered
May 9 13:34:39 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0244
complete
May 9 13:34:39 sphnxoss47 kernel: LustreError:
45314:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount
<unknown>: rc = -110
May 9 13:34:40 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem
with ordered data mode. Opts: errors=remount-ro
May 9 13:34:40 sphnxoss47 systemd[1]: tmp-mntXT85fz.mount: Succeeded.
May 9 13:34:41 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem
with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
May 9 13:34:41 sphnxoss47 kernel: Lustre: sphnx01-OST0245-osd: enabled
'large_dir' feature on device /dev/md1
May 9 13:36:00 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0245:
cannot register this server with the MGS: rc = -110. Is the MGS running?
May 9 13:36:00 sphnxoss47 kernel: LustreError:
46127:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start
targets: -110
May 9 13:36:00 sphnxoss47 kernel: LustreError:
46127:0:(obd_mount_server.c:1644:server_put_super()) no obd
sphnx01-OST0245
May 9 13:36:00 sphnxoss47 kernel: LustreError:
46127:0:(obd_mount_server.c:131:server_deregister_mount())
sphnx01-OST0245 not registered
May 9 13:36:08 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0245
complete
May 9 13:36:08 sphnxoss47 kernel: LustreError:
46127:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount
<unknown>: rc = -110
May 9 13:36:08 sphnxoss47 kernel: LDISKFS-fs (md2): mounted filesystem
with ordered data mode. Opts: errors=remount-ro
May 9 13:36:08 sphnxoss47 systemd[1]: tmp-mnt17IOaq.mount: Succeeded.
May 9 13:36:09 sphnxoss47 kernel: LDISKFS-fs (md2): mounted filesystem
with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Show less
11:03 AM
-----------------------------
it just repeats for all of the md raids, then the errors start and the
drive fails and is disabled:
May 9 13:44:31 sphnxoss47 kernel: LustreError:
48069:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount
<unknown>: rc = -110
May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
....
....
May 9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#1102 FAILED
Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s
May 9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#1102 CDB:
Read(10) 28 00 00 00 87 79 00 00 01 00
May 9 13:44:33 sphnxoss47 kernel: blk_update_request: I/O error, dev
sdef, sector 277448 op 0x0:(READ) flags 0x84700 phys_seg 1 prio class 0
May 9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#6800 FAILED
Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s
May 9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#6800 CDB:
Read(10) 28 00 00 00 87 dd 00 00 01 00
May 9 13:44:33 sphnxoss47 kernel: blk_update_request: I/O error, dev
sdef, sector 278248 op 0x0:(READ) flags 0x84700 phys_seg 1 prio class 0
May 9 13:44:33 sphnxoss47 kernel: device-mapper: multipath: 253:52:
Failing path 128:112.
May 9 13:44:33 sphnxoss47 multipathd[6051]: sdef: mark as failed
May 9 13:44:33 sphnxoss47 multipathd[6051]: mpathae: remaining active
paths: 1
...
...
May 9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0: log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May 9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0: log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May 9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0: log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May 9 13:44:34 sphnxoss47 kernel: md: super_written gets error=-5
May 9 13:44:34 sphnxoss47 kernel: md/raid:md8: Disk failure on dm-55,
disabling device.
May 9 13:44:34 sphnxoss47 kernel: md: super_written gets error=-5
May 9 13:44:34 sphnxoss47 kernel: md/raid:md8: Operation continuing on
9 devices.
May 9 13:44:34 sphnxoss47 multipathd[6051]: sdah: mark as failed
More information about the lustre-discuss
mailing list