<div dir="ltr">Thanks Robert for the feedback. Actually, I do not know about Lustre at all. <div>I am also trying to contact the engineer who built the Lustre system for more information regarding the drive information.<br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>To my knowledge, the LustreMDT pool is a 4 SSD disk group (named /dev/mapper/SSD) with hardware RAID5.</div><div><br></div><div>I can manually mount the LustreMDT/mdt0-work by following steps:</div><div><br></div><div>pcs cluster standby --all (Stop MDS and OSS)</div><div>zpool import LustreMDT</div><div>zfs set canmount=on LustreMDT/mdt0-work</div><div>zfs mount LustreMDT/mdt0-work</div><div><br></div><div>Then I ls the file /LustreMDT/mdt0-work/oi.3/0x200000003:0x2:0x0 it returned I/O error, but other files look fine.<br>[root@mds1 mdt0-work]# ls -ahlt "/LustreMDT/mdt0-work/oi.3/0x200000003:0x2:0x0"<br>ls: reading directory /LustreMDT/mdt0-work/oi.3/0x200000003:0x2:0x0: Input/output error<br>total 23M<br>drwxr-xr-x 2 root root 2 Jan  1  1970 .<br>drwxr-xr-x 0 root root 0 Jan  1  1970 ..<br></div><div><br></div><div><div>Is this the drive failure situation you referring to?</div><div></div></div><div><br></div><div dir="ltr">Best,<div>Ian</div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Sep 21, 2022 at 9:32 PM Robert Anderson <<a href="mailto:roberta@usnh.edu">roberta@usnh.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>

<div style="font-size:12pt">

<div dir="auto">

<div dir="auto">I could be reading your zpool status output wrong, but it looks like you had 2 drives in that pool. Not mirrored, so no fault tolerance. Any drive failure would lose half of the pool data. </div>

<div dir="auto"><br>

</div>

<div dir="auto">Unless you can get that drive working you are missing half of your data and have no resilience to errors, nothing to recover from. </div>

<div dir="auto"><br>

</div>

<div dir="auto">However you proceed you should ensure that have a mirrored zfs pool or more drives and raidz (I like raidz2). </div>

<div dir="auto"><br>

</div>

<div id="m_1963718352064535580aqm-original" style="color:black">

<div>

<div style="color:black">

<p style="color:black;font-size:10pt;font-family:sans-serif;margin:8pt 0px">

On September 20, 2022 11:57:09 PM Ian Yi-Feng Chang via lustre-discuss <<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a>> wrote:</p>

<blockquote type="cite" class="gmail_quote" style="margin:0px 0px 0px 0.75ex;border-left:1px solid rgb(128,128,128);padding-left:0.75ex">

<p></p>

<div style="background-color:rgb(255,235,156);width:100%;border-style:solid;border-color:rgb(156,101,0);border-width:1pt;padding:2pt;font-size:10pt;line-height:12pt;font-family:Calibri;color:black;text-align:left">

<span style="color:rgb(156,101,0);font-weight:bold">CAUTION:</span> This email originated from outside of the University System. Do not click links or open attachments unless you recognize the sender and know the content is safe.

</div>

<br>

<p></p>

<div>

<div dir="ltr">Dear All,

<div>I think this problem is more related to ZFS, but I would like to ask for help from experts in all fields.</div>

<div>Our MDT cannot work properly after the IB switch was accidentally rebooted (power issue). <br>

</div>

<div>Everything looks good except for the MDT cannot be started.</div>

<div>Our MDT's ZFS didn't have a backup or snapshot. <br>

</div>

<div>I would like to ask, could this problem be fixed and how to fix?</div>

<div><br>

</div>

<div>Thanks for your help in advance.</div>

<div><br>

Best,

<div>Ian</div>

</div>

<div><br>

</div>

<div>

<div>Lustre: Build Version: 2.10.4<br>

OS: CentOS Linux release 7.5.1804 (Core)<br>

uname -r: 3.10.0-862.el7.x86_64</div>

<div><br>

</div>

</div>

<div><br>

</div>

<div>[root@mds1 etc]# pcs status<br>

Cluster name: mdsgroup01<br>

Stack: corosync<br>

Current DC: mds1 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum<br>

Last updated: Wed Sep 21 11:46:25 2022<br>

Last change: Wed Sep 21 11:46:13 2022 by root via cibadmin on mds1<br>

<br>

2 nodes configured<br>

9 resources configured<br>

<br>

Online: [ mds1 mds2 ]<br>

<br>

Full list of resources:<br>

<br>

 Resource Group: group-MDS<br>

     zfs-LustreMDT      (ocf::heartbeat:ZFS):   Started mds1<br>

     MGT        (ocf::lustre:Lustre):   Started mds1<br>

     MDT        (ocf::lustre:Lustre):   Stopped<br>

 ipmi-fencingMDS1       (stonith:fence_ipmilan):        Started mds2<br>

 ipmi-fencingMDS2       (stonith:fence_ipmilan):        Started mds2<br>

 Clone Set: healthLUSTRE-clone [healthLUSTRE]<br>

     Started: [ mds1 mds2 ]<br>

 Clone Set: healthLNET-clone [healthLNET]<br>

     Started: [ mds1 mds2 ]<br>

<br>

Failed Actions:<br>

* MDT_start_0 on mds1 'unknown error' (1): call=44, status=complete, exitreason='',<br>

    last-rc-change='Tue Sep 20 15:01:51 2022', queued=0ms, exec=317ms<br>

* MDT_start_0 on mds2 'unknown error' (1): call=48, status=complete, exitreason='',<br>

    last-rc-change='Tue Sep 20 14:38:18 2022', queued=0ms, exec=25168ms<br>

<br>

<br>

Daemon Status:<br>

  corosync: active/enabled<br>

  pacemaker: active/enabled<br>

  pcsd: active/enabled<br>

</div>

<div><br>

</div>

<div>

<div><br>

</div>

<div><br>

</div>

<div>After zpool scrub MDT, the zpool status -v of MDT pool reported:<br>

<br>

  pool: LustreMDT<br>

 state: ONLINE<br>

status: One or more devices has experienced an error resulting in data<br>

        corruption.  Applications may be affected.<br>

action: Restore the file in question if possible.  Otherwise restore the<br>

        entire pool from backup.<br>

   see: <a href="https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzfsonlinux.org%2Fmsg%2FZFS-8000-8A&data=05%7C01%7CRobert.E.Anderson%40unh.edu%7Caf460f2d320d41de75c208da9b8559ca%7Cd6241893512d46dc8d2bbe47e25f5666%7C0%7C0%7C637993294290968694%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Kpb%2Fqtg0pMqpxN9ClcBOWlQ%2BmSiMPhMBLT0%2BE1n5mj0%3D&reserved=0" target="_blank">

http://zfsonlinux.org/msg/ZFS-8000-8A</a><br>

  scan: scrub repaired 0B in 0h35m with 1 errors on Wed Sep 21 09:38:24 2022<br>

config:<br>

<br>

        NAME        STATE     READ WRITE CKSUM<br>

        LustreMDT   ONLINE       0     0     2<br>

          SSD       ONLINE       0     0     8<br>

<br>

errors: Permanent errors have been detected in the following files:<br>

<br>

        LustreMDT/mdt0-work:/oi.3/0x200000003:0x2:0x0

<div><br>

</div>

</div>

<div><br>

</div>

<div><br>

</div>

<div># dmesg -T<br>

[Tue Sep 20 15:01:43 2022] Lustre: Lustre: Build Version: 2.10.4<br>

[Tue Sep 20 15:01:43 2022] LNet: Using FMR for registration<br>

[Tue Sep 20 15:01:43 2022] LNet: Added LNI 172.29.32.21@o2ib [8/256/0/180]<br>

[Tue Sep 20 15:01:50 2022] Lustre: MGS: Connection restored to b5823059-e620-64ac-79f6-e5282f2fa442 (at 0@lo)<br>

[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(llog.c:1296:llog_backup()) MGC172.29.32.21@o2ib: failed to open log work-MDT0000: rc = -5<br>

[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC172.29.32.21@o2ib: failed to copy remote log work-MDT0000: rc = -5<br>

[Tue Sep 20 15:01:50 2022] LustreError: 13a-8: Failed to get MGS log work-MDT0000 and no local copy.<br>

[Tue Sep 20 15:01:50 2022] LustreError: 15c-8: MGC172.29.32.21@o2ib: The configuration from log 'work-MDT0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for

 more information.<br>

[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount_server.c:1386:server_start_targets()) failed to start server work-MDT0000: -2<br>

[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start targets: -2<br>

[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount_server.c:1589:server_put_super()) no obd work-MDT0000<br>

[Tue Sep 20 15:01:50 2022] Lustre: server umount work-MDT0000 complete<br>

[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-2)<br>

[Tue Sep 20 15:01:56 2022] Lustre: 4112:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1663657311/real 1663657311]  req@ffff8d6f0e728000 x1744471122247856/t0(0) o251->MGC172.29.32.21@o2ib@0@lo:26/25 lens 224/224

 e 0 to 1 dl 1663657317 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1<br>

[Tue Sep 20 15:01:56 2022] Lustre: server umount MGS complete<br>

[Tue Sep 20 15:02:29 2022] Lustre: MGS: Connection restored to b5823059-e620-64ac-79f6-e5282f2fa442 (at 0@lo)<br>

[Tue Sep 20 15:02:54 2022] Lustre: MGS: Connection restored to 28ec81ea-0d51-d721-7be2-4f557da2546d (at 172.29.32.1@o2ib)<br clear="all">

<div>

<div dir="ltr">

<div dir="ltr">

<div dir="ltr"><br>

</div>

</div>

</div>

</div>

</div>

<div><br>

</div>

</div>

</div>

</div>

</blockquote>

</div>

</div>

</div>

<div dir="auto"><br>

</div>

</div>

</div>

</blockquote></div>