[lustre-discuss] ZFS file error of MDT
Ian Yi-Feng Chang
ian.yfchang at gmail.com
Tue Sep 20 20:56:48 PDT 2022
Dear All,
I think this problem is more related to ZFS, but I would like to ask
for help from experts in all fields.
Our MDT cannot work properly after the IB switch was accidentally rebooted
(power issue).
Everything looks good except for the MDT cannot be started.
Our MDT's ZFS didn't have a backup or snapshot.
I would like to ask, could this problem be fixed and how to fix?
Thanks for your help in advance.
Best,
Ian
Lustre: Build Version: 2.10.4
OS: CentOS Linux release 7.5.1804 (Core)
uname -r: 3.10.0-862.el7.x86_64
[root at mds1 etc]# pcs status
Cluster name: mdsgroup01
Stack: corosync
Current DC: mds1 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with
quorum
Last updated: Wed Sep 21 11:46:25 2022
Last change: Wed Sep 21 11:46:13 2022 by root via cibadmin on mds1
2 nodes configured
9 resources configured
Online: [ mds1 mds2 ]
Full list of resources:
Resource Group: group-MDS
zfs-LustreMDT (ocf::heartbeat:ZFS): Started mds1
MGT (ocf::lustre:Lustre): Started mds1
MDT (ocf::lustre:Lustre): Stopped
ipmi-fencingMDS1 (stonith:fence_ipmilan): Started mds2
ipmi-fencingMDS2 (stonith:fence_ipmilan): Started mds2
Clone Set: healthLUSTRE-clone [healthLUSTRE]
Started: [ mds1 mds2 ]
Clone Set: healthLNET-clone [healthLNET]
Started: [ mds1 mds2 ]
Failed Actions:
* MDT_start_0 on mds1 'unknown error' (1): call=44, status=complete,
exitreason='',
last-rc-change='Tue Sep 20 15:01:51 2022', queued=0ms, exec=317ms
* MDT_start_0 on mds2 'unknown error' (1): call=48, status=complete,
exitreason='',
last-rc-change='Tue Sep 20 14:38:18 2022', queued=0ms, exec=25168ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
After zpool scrub MDT, the zpool status -v of MDT pool reported:
pool: LustreMDT
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 0B in 0h35m with 1 errors on Wed Sep 21 09:38:24 2022
config:
NAME STATE READ WRITE CKSUM
LustreMDT ONLINE 0 0 2
SSD ONLINE 0 0 8
errors: Permanent errors have been detected in the following files:
LustreMDT/mdt0-work:/oi.3/0x200000003:0x2:0x0
# dmesg -T
[Tue Sep 20 15:01:43 2022] Lustre: Lustre: Build Version: 2.10.4
[Tue Sep 20 15:01:43 2022] LNet: Using FMR for registration
[Tue Sep 20 15:01:43 2022] LNet: Added LNI 172.29.32.21 at o2ib [8/256/0/180]
[Tue Sep 20 15:01:50 2022] Lustre: MGS: Connection restored to
b5823059-e620-64ac-79f6-e5282f2fa442 (at 0 at lo)
[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(llog.c:1296:llog_backup())
MGC172.29.32.21 at o2ib: failed to open log work-MDT0000: rc = -5
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC172.29.32.21 at o2ib:
failed to copy remote log work-MDT0000: rc = -5
[Tue Sep 20 15:01:50 2022] LustreError: 13a-8: Failed to get MGS log
work-MDT0000 and no local copy.
[Tue Sep 20 15:01:50 2022] LustreError: 15c-8: MGC172.29.32.21 at o2ib: The
configuration from log 'work-MDT0000' failed (-2). This may be the result
of communication errors between this node and the MGS, a bad configuration,
or other errors. See the syslog for more information.
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount_server.c:1386:server_start_targets()) failed to start
server work-MDT0000: -2
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start
targets: -2
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount_server.c:1589:server_put_super()) no obd work-MDT0000
[Tue Sep 20 15:01:50 2022] Lustre: server umount work-MDT0000 complete
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount (-2)
[Tue Sep 20 15:01:56 2022] Lustre:
4112:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1663657311/real 1663657311]
req at ffff8d6f0e728000 x1744471122247856/t0(0) o251->MGC172.29.32.21 at o2ib
@0 at lo:26/25 lens 224/224 e 0 to 1 dl 1663657317 ref 2 fl Rpc:XN/0/ffffffff
rc 0/-1
[Tue Sep 20 15:01:56 2022] Lustre: server umount MGS complete
[Tue Sep 20 15:02:29 2022] Lustre: MGS: Connection restored to
b5823059-e620-64ac-79f6-e5282f2fa442 (at 0 at lo)
[Tue Sep 20 15:02:54 2022] Lustre: MGS: Connection restored to
28ec81ea-0d51-d721-7be2-4f557da2546d (at 172.29.32.1 at o2ib)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220921/3ce86a6c/attachment.htm>
More information about the lustre-discuss
mailing list