[lustre-discuss] ZFS file error of MDT

Tue Sep 20 20:56:48 PDT 2022

Dear All,
I think this problem is more related to ZFS, but I would like to ask
for help from experts in all fields.
Our MDT cannot work properly after the IB switch was accidentally rebooted
(power issue).
Everything looks good except for the MDT cannot be started.
Our MDT's ZFS didn't have a backup or snapshot.
I would like to ask, could this problem be fixed and how to fix?

Thanks for your help in advance.

Best,
Ian

Lustre: Build Version: 2.10.4
OS: CentOS Linux release 7.5.1804 (Core)
uname -r: 3.10.0-862.el7.x86_64

[root at mds1 etc]# pcs status
Cluster name: mdsgroup01
Stack: corosync
Current DC: mds1 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with
quorum
Last updated: Wed Sep 21 11:46:25 2022
Last change: Wed Sep 21 11:46:13 2022 by root via cibadmin on mds1

2 nodes configured
9 resources configured

Online: [ mds1 mds2 ]

Full list of resources:

 Resource Group: group-MDS
     zfs-LustreMDT      (ocf::heartbeat:ZFS):   Started mds1
     MGT        (ocf::lustre:Lustre):   Started mds1
     MDT        (ocf::lustre:Lustre):   Stopped
 ipmi-fencingMDS1       (stonith:fence_ipmilan):        Started mds2
 ipmi-fencingMDS2       (stonith:fence_ipmilan):        Started mds2
 Clone Set: healthLUSTRE-clone [healthLUSTRE]
     Started: [ mds1 mds2 ]
 Clone Set: healthLNET-clone [healthLNET]
     Started: [ mds1 mds2 ]

Failed Actions:
* MDT_start_0 on mds1 'unknown error' (1): call=44, status=complete,
exitreason='',
    last-rc-change='Tue Sep 20 15:01:51 2022', queued=0ms, exec=317ms
* MDT_start_0 on mds2 'unknown error' (1): call=48, status=complete,
exitreason='',
    last-rc-change='Tue Sep 20 14:38:18 2022', queued=0ms, exec=25168ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

After zpool scrub MDT, the zpool status -v of MDT pool reported:

  pool: LustreMDT
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0h35m with 1 errors on Wed Sep 21 09:38:24 2022
config:

        NAME        STATE     READ WRITE CKSUM
        LustreMDT   ONLINE       0     0     2
          SSD       ONLINE       0     0     8

errors: Permanent errors have been detected in the following files:

        LustreMDT/mdt0-work:/oi.3/0x200000003:0x2:0x0

# dmesg -T
[Tue Sep 20 15:01:43 2022] Lustre: Lustre: Build Version: 2.10.4
[Tue Sep 20 15:01:43 2022] LNet: Using FMR for registration
[Tue Sep 20 15:01:43 2022] LNet: Added LNI 172.29.32.21 at o2ib [8/256/0/180]
[Tue Sep 20 15:01:50 2022] Lustre: MGS: Connection restored to
b5823059-e620-64ac-79f6-e5282f2fa442 (at 0 at lo)
[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(llog.c:1296:llog_backup())
MGC172.29.32.21 at o2ib: failed to open log work-MDT0000: rc = -5
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC172.29.32.21 at o2ib:
failed to copy remote log work-MDT0000: rc = -5
[Tue Sep 20 15:01:50 2022] LustreError: 13a-8: Failed to get MGS log
work-MDT0000 and no local copy.
[Tue Sep 20 15:01:50 2022] LustreError: 15c-8: MGC172.29.32.21 at o2ib: The
configuration from log 'work-MDT0000' failed (-2). This may be the result
of communication errors between this node and the MGS, a bad configuration,
or other errors. See the syslog for more information.
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount_server.c:1386:server_start_targets()) failed to start
server work-MDT0000: -2
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start
targets: -2
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount_server.c:1589:server_put_super()) no obd work-MDT0000
[Tue Sep 20 15:01:50 2022] Lustre: server umount work-MDT0000 complete
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-2)
[Tue Sep 20 15:01:56 2022] Lustre:
4112:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1663657311/real 1663657311]
 req at ffff8d6f0e728000 x1744471122247856/t0(0) o251->MGC172.29.32.21 at o2ib
@0 at lo:26/25 lens 224/224 e 0 to 1 dl 1663657317 ref 2 fl Rpc:XN/0/ffffffff
rc 0/-1
[Tue Sep 20 15:01:56 2022] Lustre: server umount MGS complete
[Tue Sep 20 15:02:29 2022] Lustre: MGS: Connection restored to
b5823059-e620-64ac-79f6-e5282f2fa442 (at 0 at lo)
[Tue Sep 20 15:02:54 2022] Lustre: MGS: Connection restored to
28ec81ea-0d51-d721-7be2-4f557da2546d (at 172.29.32.1 at o2ib)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220921/3ce86a6c/attachment.htm>