<div dir="ltr">Dear All,<div>I think this problem is more related to ZFS, but I would like to ask for help from experts in all fields.</div><div>Our MDT cannot work properly after the IB switch was accidentally rebooted (power issue). <br></div><div>Everything looks good except for the MDT cannot be started.</div><div>Our MDT's ZFS didn't have a backup or snapshot. <br></div><div>I would like to ask, could this problem be fixed and how to fix?</div><div><br></div><div>Thanks for your help in advance.</div><div><br class="gmail-Apple-interchange-newline">Best,<div>Ian</div></div><div><br></div><div><div>Lustre: Build Version: 2.10.4<br>OS: CentOS Linux release 7.5.1804 (Core)<br>uname -r: 3.10.0-862.el7.x86_64</div><div><br></div></div><div><br></div><div>[root@mds1 etc]# pcs status<br>Cluster name: mdsgroup01<br>Stack: corosync<br>Current DC: mds1 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum<br>Last updated: Wed Sep 21 11:46:25 2022<br>Last change: Wed Sep 21 11:46:13 2022 by root via cibadmin on mds1<br><br>2 nodes configured<br>9 resources configured<br><br>Online: [ mds1 mds2 ]<br><br>Full list of resources:<br><br> Resource Group: group-MDS<br>     zfs-LustreMDT      (ocf::heartbeat:ZFS):   Started mds1<br>     MGT        (ocf::lustre:Lustre):   Started mds1<br>     MDT        (ocf::lustre:Lustre):   Stopped<br> ipmi-fencingMDS1       (stonith:fence_ipmilan):        Started mds2<br> ipmi-fencingMDS2       (stonith:fence_ipmilan):        Started mds2<br> Clone Set: healthLUSTRE-clone [healthLUSTRE]<br>     Started: [ mds1 mds2 ]<br> Clone Set: healthLNET-clone [healthLNET]<br>     Started: [ mds1 mds2 ]<br><br>Failed Actions:<br>* MDT_start_0 on mds1 'unknown error' (1): call=44, status=complete, exitreason='',<br>    last-rc-change='Tue Sep 20 15:01:51 2022', queued=0ms, exec=317ms<br>* MDT_start_0 on mds2 'unknown error' (1): call=48, status=complete, exitreason='',<br>    last-rc-change='Tue Sep 20 14:38:18 2022', queued=0ms, exec=25168ms<br><br><br>Daemon Status:<br>  corosync: active/enabled<br>  pacemaker: active/enabled<br>  pcsd: active/enabled<br></div><div><br></div><div><div><br></div><div><br></div><div>After zpool scrub MDT, the zpool status -v of MDT pool reported:<br><br>  pool: LustreMDT<br> state: ONLINE<br>status: One or more devices has experienced an error resulting in data<br>        corruption.  Applications may be affected.<br>action: Restore the file in question if possible.  Otherwise restore the<br>        entire pool from backup.<br>   see: <a href="http://zfsonlinux.org/msg/ZFS-8000-8A" target="_blank">http://zfsonlinux.org/msg/ZFS-8000-8A</a><br>  scan: scrub repaired 0B in 0h35m with 1 errors on Wed Sep 21 09:38:24 2022<br>config:<br><br>        NAME        STATE     READ WRITE CKSUM<br>        LustreMDT   ONLINE       0     0     2<br>          SSD       ONLINE       0     0     8<br><br>errors: Permanent errors have been detected in the following files:<br><br>        LustreMDT/mdt0-work:/oi.3/0x200000003:0x2:0x0<div><br></div></div><div><br></div><div><br></div><div># dmesg -T<br>[Tue Sep 20 15:01:43 2022] Lustre: Lustre: Build Version: 2.10.4<br>[Tue Sep 20 15:01:43 2022] LNet: Using FMR for registration<br>[Tue Sep 20 15:01:43 2022] LNet: Added LNI 172.29.32.21@o2ib [8/256/0/180]<br>[Tue Sep 20 15:01:50 2022] Lustre: MGS: Connection restored to b5823059-e620-64ac-79f6-e5282f2fa442 (at 0@lo)<br>[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(llog.c:1296:llog_backup()) MGC172.29.32.21@o2ib: failed to open log work-MDT0000: rc = -5<br>[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC172.29.32.21@o2ib: failed to copy remote log work-MDT0000: rc = -5<br>[Tue Sep 20 15:01:50 2022] LustreError: 13a-8: Failed to get MGS log work-MDT0000 and no local copy.<br>[Tue Sep 20 15:01:50 2022] LustreError: 15c-8: MGC172.29.32.21@o2ib: The configuration from log 'work-MDT0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.<br>[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount_server.c:1386:server_start_targets()) failed to start server work-MDT0000: -2<br>[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start targets: -2<br>[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount_server.c:1589:server_put_super()) no obd work-MDT0000<br>[Tue Sep 20 15:01:50 2022] Lustre: server umount work-MDT0000 complete<br>[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-2)<br>[Tue Sep 20 15:01:56 2022] Lustre: 4112:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1663657311/real 1663657311]  req@ffff8d6f0e728000 x1744471122247856/t0(0) o251->MGC172.29.32.21@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1663657317 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1<br>[Tue Sep 20 15:01:56 2022] Lustre: server umount MGS complete<br>[Tue Sep 20 15:02:29 2022] Lustre: MGS: Connection restored to b5823059-e620-64ac-79f6-e5282f2fa442 (at 0@lo)<br>[Tue Sep 20 15:02:54 2022] Lustre: MGS: Connection restored to 28ec81ea-0d51-d721-7be2-4f557da2546d (at 172.29.32.1@o2ib)<br clear="all"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div></div></div></div></div><div><br></div></div></div>