[lustre-discuss] ZFS file error of MDT

Fri Sep 23 14:35:05 PDT 2022

Hi Ian-
It looks to me like that hardware RAID array is giving ZFS data back that is not what ZFS thinks it wrote.  Since from ZFS’ perspective there is no redundancy in the pool, only what the RAID array returns, ZFS cannot reconstruct the file to its satisfaction, and rather than return data that ZFS thinks is corrupt, it is refusing to allow that file to be accessed at all.  Lustre, which relies on the lower layers for redundancy, expects the file to be accessible, and it’s not.
-Laura

________________________________________
Od: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> v imenu Ian Yi-Feng Chang via lustre-discuss <lustre-discuss at lists.lustre.org>
Poslano: sreda, 21. september 2022 10:53
Za: Robert Anderson; lustre-discuss at lists.lustre.org
Zadeva: [EXTERNAL] Re: [lustre-discuss] ZFS file error of MDT

Thanks Robert for the feedback. Actually, I do not know about Lustre at all.
I am also trying to contact the engineer who built the Lustre system for more information regarding the drive information.
To my knowledge, the LustreMDT pool is a 4 SSD disk group (named /dev/mapper/SSD) with hardware RAID5.

I can manually mount the LustreMDT/mdt0-work by following steps:

pcs cluster standby --all (Stop MDS and OSS)
zpool import LustreMDT
zfs set canmount=on LustreMDT/mdt0-work
zfs mount LustreMDT/mdt0-work

Then I ls the file /LustreMDT/mdt0-work/oi.3/0x200000003:0x2:0x0 it returned I/O error, but other files look fine.
[root at mds1 mdt0-work]# ls -ahlt "/LustreMDT/mdt0-work/oi.3/0x200000003:0x2:0x0"
ls: reading directory /LustreMDT/mdt0-work/oi.3/0x200000003:0x2:0x0: Input/output error
total 23M
drwxr-xr-x 2 root root 2 Jan  1  1970 .
drwxr-xr-x 0 root root 0 Jan  1  1970 ..

Is this the drive failure situation you referring to?

Best,
Ian

On Wed, Sep 21, 2022 at 9:32 PM Robert Anderson <roberta at usnh.edu<mailto:roberta at usnh.edu>> wrote:
I could be reading your zpool status output wrong, but it looks like you had 2 drives in that pool. Not mirrored, so no fault tolerance. Any drive failure would lose half of the pool data.

Unless you can get that drive working you are missing half of your data and have no resilience to errors, nothing to recover from.

However you proceed you should ensure that have a mirrored zfs pool or more drives and raidz (I like raidz2).

On September 20, 2022 11:57:09 PM Ian Yi-Feng Chang via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> wrote:

CAUTION: This email originated from outside of the University System. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Dear All,
I think this problem is more related to ZFS, but I would like to ask for help from experts in all fields.
Our MDT cannot work properly after the IB switch was accidentally rebooted (power issue).
Everything looks good except for the MDT cannot be started.
Our MDT's ZFS didn't have a backup or snapshot.
I would like to ask, could this problem be fixed and how to fix?

Thanks for your help in advance.

Best,
Ian

Lustre: Build Version: 2.10.4
OS: CentOS Linux release 7.5.1804 (Core)
uname -r: 3.10.0-862.el7.x86_64

[root at mds1 etc]# pcs status
Cluster name: mdsgroup01
Stack: corosync
Current DC: mds1 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Wed Sep 21 11:46:25 2022
Last change: Wed Sep 21 11:46:13 2022 by root via cibadmin on mds1

2 nodes configured
9 resources configured

Online: [ mds1 mds2 ]

Full list of resources:

 Resource Group: group-MDS
     zfs-LustreMDT      (ocf::heartbeat:ZFS):   Started mds1
     MGT        (ocf::lustre:Lustre):   Started mds1
     MDT        (ocf::lustre:Lustre):   Stopped
 ipmi-fencingMDS1       (stonith:fence_ipmilan):        Started mds2
 ipmi-fencingMDS2       (stonith:fence_ipmilan):        Started mds2
 Clone Set: healthLUSTRE-clone [healthLUSTRE]
     Started: [ mds1 mds2 ]
 Clone Set: healthLNET-clone [healthLNET]
     Started: [ mds1 mds2 ]

Failed Actions:
* MDT_start_0 on mds1 'unknown error' (1): call=44, status=complete, exitreason='',
    last-rc-change='Tue Sep 20 15:01:51 2022', queued=0ms, exec=317ms
* MDT_start_0 on mds2 'unknown error' (1): call=48, status=complete, exitreason='',
    last-rc-change='Tue Sep 20 14:38:18 2022', queued=0ms, exec=25168ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

After zpool scrub MDT, the zpool status -v of MDT pool reported:

  pool: LustreMDT
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0h35m with 1 errors on Wed Sep 21 09:38:24 2022
config:

        NAME        STATE     READ WRITE CKSUM
        LustreMDT   ONLINE       0     0     2
          SSD       ONLINE       0     0     8

errors: Permanent errors have been detected in the following files:

        LustreMDT/mdt0-work:/oi.3/0x200000003:0x2:0x0

# dmesg -T
[Tue Sep 20 15:01:43 2022] Lustre: Lustre: Build Version: 2.10.4
[Tue Sep 20 15:01:43 2022] LNet: Using FMR for registration
[Tue Sep 20 15:01:43 2022] LNet: Added LNI 172.29.32.21 at o2ib [8/256/0/180]
[Tue Sep 20 15:01:50 2022] Lustre: MGS: Connection restored to b5823059-e620-64ac-79f6-e5282f2fa442 (at 0 at lo)
[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(llog.c:1296:llog_backup()) MGC172.29.32.21 at o2ib: failed to open log work-MDT0000: rc = -5
[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC172.29.32.21 at o2ib: failed to copy remote log work-MDT0000: rc = -5
[Tue Sep 20 15:01:50 2022] LustreError: 13a-8: Failed to get MGS log work-MDT0000 and no local copy.
[Tue Sep 20 15:01:50 2022] LustreError: 15c-8: MGC172.29.32.21 at o2ib: The configuration from log 'work-MDT0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount_server.c:1386:server_start_targets()) failed to start server work-MDT0000: -2
[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start targets: -2
[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount_server.c:1589:server_put_super()) no obd work-MDT0000
[Tue Sep 20 15:01:50 2022] Lustre: server umount work-MDT0000 complete
[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-2)
[Tue Sep 20 15:01:56 2022] Lustre: 4112:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1663657311/real 1663657311]  req at ffff8d6f0e728000 x1744471122247856/t0(0) o251->MGC172.29.32.21 at o2ib@0 at lo:26/25 lens 224/224 e 0 to 1 dl 1663657317 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[Tue Sep 20 15:01:56 2022] Lustre: server umount MGS complete
[Tue Sep 20 15:02:29 2022] Lustre: MGS: Connection restored to b5823059-e620-64ac-79f6-e5282f2fa442 (at 0 at lo)
[Tue Sep 20 15:02:54 2022] Lustre: MGS: Connection restored to 28ec81ea-0d51-d721-7be2-4f557da2546d (at 172.29.32.1 at o2ib)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220923/8e5c3de7/attachment-0001.htm>