[lustre-discuss] Help with recovery of data

Tue Jun 21 16:27:03 PDT 2022

Hi everyone,

We ran into a problem with our lustre filesystem this weekend and could use a sanity check and/or advice on recovery.

We are running on CentOS 7.9, ZFS 2.1.4 and Lustre 2.14.  We are using ZFS OST’s but and an ldiskfs MDT (for better MDT performance).  For various reasons, the ldiskfs is built on a zdev.  Every night we (intend to) back up the metadata by ZFS snapshot-ing the zdev, mount the MDT via ldiskfs and tar up the contents, umount and remove the ZFS snapshot.  On Sunday (6/19 at about 4 pm), the metadata server crashed.  It came back up fine but users started reporting many missing files and directories today (6/21) – everything since about February 9th is gone.  After quite a bit of investigation, it looks like the MDT got rolled back to a snapshot of the metadata from February.

[root at hpfs-fsl-mds1 ~]# zfs list -t snap mds1-0/meta-scratch
NAME                       USED  AVAIL     REFER  MOUNTPOINT
mds1-0/meta-scratch at snap  52.3G      -     1.34T  -
[root at hpfs-fsl-mds1 ~]# zfs get all mds1-0/meta-scratch at snap | grep creation
mds1-0/meta-scratch at snap  creation              Thu Feb 10  3:35 2022          -
[root at hpfs-fsl-mds1 ~]#

We discovered that our MDT backups have been stalled since February since the first step is to create mds1-0/meta-scratch at snap  and that dataset already exists.  The script was erroring out since the existing snapshot still in place.  We have rebooted this MDS several times (gracefully) since February with no issues but, apparently, whatever happened in the server crash on Sunday caused the MDT to revert to the February data.  So, in theory, the data on the OST’s is still there, we are just missing the metadata due to the ZFS glitch.

So the first question - is anyone familiar with this failure mode of ZFS or if there is a way recover from it?  I think its unlikely there are any direct ZFS recovery options but wanted to ask.

Obviously, MDT backups would be our best recovery option but since this was all caused by the backup scripts stalling (and the subsequent rolling back to the last snapshot), our backups are the same age as the current data on the filesystem.

[root at hpfs-fsl-mds1 ~]# ls -lrt /internal/ldiskfs_backups/
total 629789909
-rw-r--r-- 1 root root         1657 Apr 30  2019 process.txt
-rw-r--r-- 1 root root 445317560320 Jan 25 15:36 mds1-0_meta-scratch-2022_01_25.tar
-rw-r--r-- 1 root root 446230016000 Jan 26 15:31 mds1-0_meta-scratch-2022_01_26.tar
-rw-r--r-- 1 root root 448093808640 Jan 27 15:46 mds1-0_meta-scratch-2022_01_27.tar
-rw-r--r-- 1 root root 440368783360 Jan 28 16:56 mds1-0_meta-scratch-2022_01_28.tar
-rw-r--r-- 1 root root 442342113280 Jan 29 14:45 mds1-0_meta-scratch-2022_01_29.tar
-rw-r--r-- 1 root root 442922567680 Jan 30 15:03 mds1-0_meta-scratch-2022_01_30.tar
-rw-r--r-- 1 root root 443076515840 Jan 31 15:17 mds1-0_meta-scratch-2022_01_31.tar
-rw-r--r-- 1 root root 444589025280 Feb  1 15:11 mds1-0_meta-scratch-2022_02_01.tar
-rw-r--r-- 1 root root 443741409280 Feb  2 15:17 mds1-0_meta-scratch-2022_02_02.tar
-rw-r--r-- 1 root root 448209367040 Feb  3 15:24 mds1-0_meta-scratch-2022_02_03.tar
-rw-r--r-- 1 root root 453777090560 Feb  4 15:55 mds1-0_meta-scratch-2022_02_04.tar
-rw-r--r-- 1 root root 454211307520 Feb  5 14:37 mds1-0_meta-scratch-2022_02_05.tar
-rw-r--r-- 1 root root 454619084800 Feb  6 14:30 mds1-0_meta-scratch-2022_02_06.tar
-rw-r--r-- 1 root root 455459276800 Feb  7 15:26 mds1-0_meta-scratch-2022_02_07.tar
-rw-r--r-- 1 root root 457470945280 Feb  8 15:07 mds1-0_meta-scratch-2022_02_08.tar
-rw-r--r-- 1 root root 460592517120 Feb  9 15:21 mds1-0_meta-scratch-2022_02_09.tar
-rw-r--r-- 1 root root 332377712640 Feb 10 12:04 mds1-0_meta-scratch-2022_02_10.tar
[root at hpfs-fsl-mds1 ~]#

Yes, I know, we will put in some monitoring for this in the future...

Fortunately, we also have a robinhood system syncing with this file system.  The sync is fairly up to date – the logs say a few days ago and I’ve used rbh-find to find some files that were created in the last few days.  So I think we have a shot at recovery.  We have this command running now to see what it will do:

rbh-diff --apply=fs --dry-run --scan=/scratch-lustre

But it has already been running a long time with no output.  Our file system is fairly large:

[root at hpfs-fsl-lmon0 ~]# lfs df -h /scratch-lustre
UUID                       bytes        Used   Available Use% Mounted on
scratch-MDT0000_UUID     1011.8G       82.6G      826.7G  10% /scratch-lustre[MDT:0]
scratch-OST0000_UUID       49.6T       16.2T       33.4T  33% /scratch-lustre[OST:0]
scratch-OST0001_UUID       49.6T       17.4T       32.3T  35% /scratch-lustre[OST:1]
scratch-OST0002_UUID       49.6T       16.8T       32.8T  34% /scratch-lustre[OST:2]
scratch-OST0003_UUID       49.6T       17.2T       32.4T  35% /scratch-lustre[OST:3]
scratch-OST0004_UUID       49.6T       16.7T       32.9T  34% /scratch-lustre[OST:4]
scratch-OST0005_UUID       49.6T       16.9T       32.7T  35% /scratch-lustre[OST:5]
scratch-OST0006_UUID       49.6T       16.4T       33.2T  34% /scratch-lustre[OST:6]
scratch-OST0007_UUID       49.6T       15.6T       34.0T  32% /scratch-lustre[OST:7]
scratch-OST0008_UUID       49.6T       16.2T       33.4T  33% /scratch-lustre[OST:8]
scratch-OST0009_UUID       49.6T       16.4T       33.2T  34% /scratch-lustre[OST:9]
scratch-OST000a_UUID       49.6T       15.8T       33.8T  32% /scratch-lustre[OST:10]
scratch-OST000b_UUID       49.6T       17.4T       32.2T  36% /scratch-lustre[OST:11]
scratch-OST000c_UUID       49.6T       17.1T       32.5T  35% /scratch-lustre[OST:12]
scratch-OST000d_UUID       49.6T       15.8T       33.8T  32% /scratch-lustre[OST:13]
scratch-OST000e_UUID       49.6T       15.7T       33.9T  32% /scratch-lustre[OST:14]
scratch-OST000f_UUID       49.6T       16.4T       33.2T  33% /scratch-lustre[OST:15]
scratch-OST0010_UUID       49.6T       15.5T       34.1T  32% /scratch-lustre[OST:16]
scratch-OST0011_UUID       49.6T       16.6T       33.1T  34% /scratch-lustre[OST:17]
scratch-OST0012_UUID       49.6T       16.4T       33.2T  34% /scratch-lustre[OST:18]
scratch-OST0013_UUID       48.4T       16.3T       32.1T  34% /scratch-lustre[OST:19]
scratch-OST0014_UUID       49.6T       15.1T       34.5T  31% /scratch-lustre[OST:20]
scratch-OST0015_UUID       49.6T       16.0T       33.6T  33% /scratch-lustre[OST:21]
scratch-OST0016_UUID       49.6T       15.2T       34.4T  31% /scratch-lustre[OST:22]
scratch-OST0017_UUID       49.6T       16.1T       33.5T  33% /scratch-lustre[OST:23]

filesystem_summary:         1.2P      391.1T      798.3T  33% /scratch-lustre

[root at hpfs-fsl-lmon0 ~]#

We currently still have the robinhood process running (syncing the filesystem and the SQL DB) but we’ve umounted the LFS from all user facing machines so there should be no further changes to the filesystem.

Does anyone have experience recovering from this kind of situation with robinhood?

FWIW, the SQL DB that robinhood lives on is also a ZFS filesystem that we also snapshot.  But we don’t have much history and its unclear where the current RBH scans are (WRT the data loss).  But its likely SQL DB in the oldest snapshot below would not be affected by the 6/19 reboot event.

[root at hpfs-fsl-lmon0 ~]# zfs list -t snap
NAME                                           USED  AVAIL  REFER  MOUNTPOINT
lmon0-0/mysql at zincrsend_2022-06-20-16:01:01   26.8G      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-20-17:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-20-18:01:01    557K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-20-19:01:01    558K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-20-20:01:01    558K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-20-21:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-20-22:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-20-23:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-00:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-01:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-02:01:02    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-03:01:01    561K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-04:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-05:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-06:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-07:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-08:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-09:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-10:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-11:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-12:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-13:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-14:01:01    560K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-15:01:01    518K      -   105G  -
lmon0-0/mysql at zincrsend_2022-06-21-16:01:01    512K      -   105G  -
[root at hpfs-fsl-lmon0 ~]#

Is RBH our best recovery option?

Would lfsck recover from this situation?  I don’t think so...

Any advice on recovery would be appreciated.

Thanks,
Darby
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220621/09090a8d/attachment-0001.htm>