[lustre-discuss] [EXTERNAL] Re: Help with recovery of data

Wed Jun 22 20:52:59 PDT 2022

Thanks Andreas – I appreciate the info.

I am dd’ing the MDT block device (both of them – more details below) to separate storage now.

I’ve written this up on the ZFS mailing list.

https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tcb8a3ef663db0031/need-help-with-data-recovery-if-possible

Actually, in the process of doing that, I think I see what is going on.  More details in the ZFS post but it looks like the block device names for the ZFS volumes got swapped on the crash and reboot.  So /dev/zd0 is a clone of the February snapshot and /dev/zd16 is actually our primary (current) MDT.  If I mount zd16 and poke around, I see lots of files newer than February.

[root at hpfs-fsl-mds1 ~]# mount -t ldiskfs -o ro /dev/zd16 /mnt/mdt_backup/
[root at hpfs-fsl-mds1 ~]# cd /mnt/mdt_backup/
[root at hpfs-fsl-mds1 mdt_backup]# ls -l PENDING/
total 0
-rw------- 1 ecdavis2 damocles 0 Jun 17 11:03 0x200021094:0x3b26:0x0
-rw------- 1 rharpold rharpold 0 Jun 14 12:27 0x200021096:0x1337:0x0
[root at hpfs-fsl-mds1 mdt_backup]#

So it looks like we have a shot at recovery.  I hope to get more guidance on the ZFS list on how to properly swap zd0 and zd16 back.  I’m also tarring up the contents of the read only mount of zd16.  In all:

dd if=/dev/zd0 of=/internal/zd0.dd.2022.06.22 bs=1M
dd if=/dev/zd16 of=/internal/zd16.dd.2022.06.22 bs=1M
cd /mnt/mdt_backup ; tar cf /internal/zd16.tar --xattrs --xattrs-include="trusted.*" --sparse .

Please let me know if there is something else we should consider doing before attempting recovery.

Actually, I’m 100% certain this is our current MDT.  I see files and directories in /mnt/mdt_backup/ROOT that were just created in the last couple weeks.  Happy day.

One other question.  We are seeing a ton of these in the MDS logs since the crash.

Jun 22 21:53:16 hpfs-fsl-mds1 kernel: LustreError: 14346:0:(qmt_handler.c:699:qmt_dqacq0()) $$$ Release too much! uuid:scratch-MDT0000-lwp-OST000f_UUID release: 67108864 granted:0, total:0  qmt:scratch-QMT0000 pool:dt-0x0 id:5697 enforced:0 hard:0 soft:0 granted:0 time:0 qunit: 0 edquot:0 may_rel:0 revoke:0 default:yes

I assume this is not unexpected with an MDT that got reverted?

From: Andreas Dilger <adilger at whamcloud.com>
Date: Wednesday, June 22, 2022 at 4:48 PM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" <darby.vicker-1 at nasa.gov>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: [EXTERNAL] Re: [lustre-discuss] Help with recovery of data

First thing, if you haven't already done so, would be to make a separate "dd" backup of the ldiskfs MDT(s) to some external storage before you do anything else.  That will give you a fallback in case whatever changes you make don't work out well.

I would also suggest to contact the ZFS mailing list to ask if they can help restore the "new version" of the MDT at the ZFS level.  You may also want to consider a separate ZFS-level backup because the core of the problem appears to be ZFS related.  Unfortunately, the opportunity to recover a newer version of the ldiskfs MDT at the ZFS level declines the more changes are made to the ZFS pool.

I don't think LFSCK will repair the missing files on the MDT, since the OSTs don't have enough information to regenerate the namespace.  At most LFSCK will create stub files on the MDT under .lustre/lost+found that connect the objects for the new files created after your MDT snapshot, but they won't have proper filenames.  At most they will have UID/GID/timestamps to identify the owners/age, and the users would need to identify the files by content.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220623/4aad9f0c/attachment.htm>