[lustre-discuss] [BULK] Re: [EXTERNAL] Re: Data recovery with lost MDT data

Mon Sep 25 07:52:37 PDT 2023

Our lfsck finished.  It repair a lot and we have over 13 million files in lost+found to go through.  I'll be writing a script to move these to somewhere accessible by the users and grouped by owner and probably date too (trying not to get too many files in a single directory).  Thanks again for the help with this.

For the benefit of others, this is how we started our lfsck:

[root at hpfs-fsl-mds1 hpfs3-eg3]# lctl set_param printk=+lfsck
[root at hpfs-fsl-mds1 hpfs3-eg3]# lctl lfsck_start -M scratch-MDT0000 -o
Started LFSCK on the device scratch-MDT0000: scrub layout namespace
[root at hpfs-fsl-mds1 hpfs3-eg3]#

It took most of the weekend to run.  Here are the results.

[root at hpfs-fsl-mds1 ~]# lctl lfsck_query -M scratch-MDT0000

layout_mdts_init: 0

layout_mdts_scanning-phase1: 0

layout_mdts_scanning-phase2: 0

layout_mdts_completed: 1

layout_mdts_failed: 0

layout_mdts_stopped: 0

layout_mdts_paused: 0

layout_mdts_crashed: 0

layout_mdts_partial: 0

layout_mdts_co-failed: 0

layout_mdts_co-stopped: 0

layout_mdts_co-paused: 0

layout_mdts_unknown: 0

layout_osts_init: 0

layout_osts_scanning-phase1: 0

layout_osts_scanning-phase2: 0

layout_osts_completed: 22

layout_osts_failed: 0

layout_osts_stopped: 0

layout_osts_paused: 0

layout_osts_crashed: 0

layout_osts_partial: 2

layout_osts_co-failed: 0

layout_osts_co-stopped: 0

layout_osts_co-paused: 0

layout_osts_unknown: 0

layout_repaired: 38587653

namespace_mdts_init: 0

namespace_mdts_scanning-phase1: 0

namespace_mdts_scanning-phase2: 0

namespace_mdts_completed: 1

namespace_mdts_failed: 0

namespace_mdts_stopped: 0

namespace_mdts_paused: 0

namespace_mdts_crashed: 0

namespace_mdts_partial: 0

namespace_mdts_co-failed: 0

namespace_mdts_co-stopped: 0

namespace_mdts_co-paused: 0

namespace_mdts_unknown: 0

namespace_osts_init: 0

namespace_osts_scanning-phase1: 0

namespace_osts_scanning-phase2: 0

namespace_osts_completed: 0

namespace_osts_failed: 0

namespace_osts_stopped: 0

namespace_osts_paused: 0

namespace_osts_crashed: 0

namespace_osts_partial: 0

namespace_osts_co-failed: 0

namespace_osts_co-stopped: 0

namespace_osts_co-paused: 0

namespace_osts_unknown: 0

namespace_repaired: 1429495

[root at hpfs-fsl-mds1 ~]# lctl get_param -n mdd.scratch-MDT0000.lfsck_layout

name: lfsck_layout

magic: 0xb1732fed

version: 2

status: completed

flags:

param: all_targets,orphan

last_completed_time: 1695615657

time_since_last_completed: 35014 seconds

latest_start_time: 1695335260

time_since_latest_start: 315411 seconds

last_checkpoint_time: 1695615657

time_since_last_checkpoint: 35014 seconds

latest_start_position: 15

last_checkpoint_position: 1015668480

first_failure_position: 0

success_count: 2

repaired_dangling: 22199282

repaired_unmatched_pair: 0

repaired_multiple_referenced: 0

repaired_orphan: 13489715

repaired_inconsistent_owner: 2898656

repaired_others: 0

skipped: 0

failed_phase1: 0

failed_phase2: 1798679

checked_phase1: 369403698

checked_phase2: 15

run_time_phase1: 82842 seconds

run_time_phase2: 0 seconds

average_speed_phase1: 4459 items/sec

average_speed_phase2: 15 objs/sec

real_time_speed_phase1: N/A

real_time_speed_phase2: N/A

current_position: N/A

[root at hpfs-fsl-mds1 ~]# lctl get_param -n mdd.scratch-MDT0000.lfsck_namespace

name: lfsck_namespace

magic: 0xa06249ff

version: 2

status: completed

flags:

param: all_targets,orphan

last_completed_time: 1695419689

time_since_last_completed: 231012 seconds

latest_start_time: 1695335262

time_since_latest_start: 315439 seconds

last_checkpoint_time: 1695419689

time_since_last_checkpoint: 231012 seconds

latest_start_position: 15, N/A, N/A

last_checkpoint_position: 1015668480, N/A, N/A

first_failure_position: N/A, N/A, N/A

checked_phase1: 315743757

checked_phase2: 231782

updated_phase1: 1429495

updated_phase2: 0

failed_phase1: 0

failed_phase2: 0

directories: 15911850

dirent_repaired: 0

linkea_repaired: 1429495

nlinks_repaired: 0

multiple_linked_checked: 1983873

multiple_linked_repaired: 0

unknown_inconsistency: 0

unmatched_pairs_repaired: 0

dangling_repaired: 0

multiple_referenced_repaired: 0

bad_file_type_repaired: 0

lost_dirent_repaired: 0

local_lost_found_scanned: 0

local_lost_found_moved: 0

local_lost_found_skipped: 0

local_lost_found_failed: 0

striped_dirs_scanned: 0

striped_dirs_repaired: 0

striped_dirs_failed: 0

striped_dirs_disabled: 0

striped_dirs_skipped: 0

striped_shards_scanned: 0

striped_shards_repaired: 0

striped_shards_failed: 0

striped_shards_skipped: 0

name_hash_repaired: 0

linkea_overflow_cleared: 0

agent_entries_repaired: 0

success_count: 2

run_time_phase1: 82885 seconds

run_time_phase2: 1542 seconds

average_speed_phase1: 3809 items/sec

average_speed_phase2: 150 objs/sec

average_speed_total: 3742 items/sec

real_time_speed_phase1: N/A

real_time_speed_phase2: N/A

current_position: N/A

[root at hpfs-fsl-mds1 ~]#

And on a client, the resulting lost+found directory:

[root at hpfs-fsl-lmon0 MDT0000]# pwd

/scratch-lustre/.lustre/lost+found/MDT0000

[root at hpfs-fsl-lmon0 MDT0000]# time \ls | wc -l

13063930

real    15m20.604s

user    2m38.186s

sys     0m4.116s
[root at hpfs-fsl-lmon0 MDT0000]# \ls | head
[0x20000bdbe:0x1eae1:0x0]-R-0
[0x20000bdeb:0x12c3e:0x0]-R-0
[0x20001f801:0x296f:0x0]-R-0
[0x20001f801:0x57:0x0]-R-0
[0x20001f801:0x58:0x0]-R-0
[0x20001f805:0x1000:0x0]-R-0
[0x20001f805:0x100:0x0]-R-0
[0x20001f805:0x1001:0x0]-R-0
[0x20001f805:0x1002:0x0]-R-0
[0x20001f805:0x1003:0x0]-R-0
[root at hpfs-fsl-lmon0 MDT0000]# ls -l [0x20000bdbe:0x1eae1:0x0]-R-0
-r-------- 1 damocles_runner damocles 3162 Sep 24 14:45 [0x20000bdbe:0x1eae1:0x0]-R-0
[root at hpfs-fsl-lmon0 MDT0000]#

From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss" <lustre-discuss at lists.lustre.org>
Reply-To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" <darby.vicker-1 at nasa.gov>
Date: Friday, September 22, 2023 at 2:49 PM
To: Andreas Dilger <adilger at whamcloud.com>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: [BULK] Re: [lustre-discuss] [EXTERNAL] Re: Data recovery with lost MDT data

I’m only showing you the last 10 directories below but there are about 30 or 40 directories with a pretty uniform distribution between 6/20 and now.  If it was a situation where we had been rolled back to 6/20 but directories were starting to be updated again, there should be a big gap with no updates.  The rollback (when we deleted the “snapshot”) happened on Monday, 9/18.  We could do another snapshot of the MDT, mount it read only and poke around in there if you think that would help.  Actually, our backup process (which is running normally again) is doing just that.  It takes quite a long time to complete so there is opportunity for me to investigate.

From: Andreas Dilger <adilger at whamcloud.com>
Date: Friday, September 22, 2023 at 1:36 AM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" <darby.vicker-1 at nasa.gov>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: Re: [EXTERNAL] Re: [lustre-discuss] Data recovery with lost MDT data

CAUTION: This email originated from outside of NASA.  Please take care when clicking links or opening attachments.  Use the "Report Message" button to report suspicious messages to the NASA SOC.

On Sep 21, 2023, at 16:06, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] <darby.vicker-1 at nasa.gov<mailto:darby.vicker-1 at nasa.gov>> wrote:

I knew an lfsck would identify the orphaned objects.  That’s great that it will move those objects to an area we can triage.  With ownership still intact (and I assume time stamps too), I think this will be helpful for at least some of the users to recover some of their data.  Thanks Andreas.

I do have another question.  Even with the MDT loss, the top level user directories on the file system are still showing current modification times.  I was a little surprised to see this – my expectation was that the most current time would be from the snapshot that we accidentally reverted to, 6/20/2023 in this case.  Does this make sense?

The timestamps of the directories are only stored on the MDT (unlike regular files which keep of the timestamp on both the MDT and OST).  Is it possible that users (or possibly recovered clients with existing mountpoints) have started to access the filesystem in the past few days since it was recovered, or an admin was doing something that would have caused the directories to be modified?

Is it possible you have a newer copy of the MDT than you thought?

[dvicker at dvicker ~]$ ls -lrt /ephemeral/ | tail
  4 drwx------     2 abjuarez               abjuarez             4096 Sep 12 13:24 abjuarez/
  4 drwxr-x---     2 ksmith29               ksmith29             4096 Sep 13 15:37 ksmith29/
  4 drwxr-xr-x    55 bjjohn10               bjjohn10             4096 Sep 13 16:36 bjjohn10/
  4 drwxrwx---     3 cbrownsc               ccp_fast             4096 Sep 14 12:27 cbrownsc/
  4 drwx------     3 fgholiza               fgholiza             4096 Sep 18 06:41 fgholiza/
  4 drwx------     5 mtfoste2               mtfoste2             4096 Sep 19 11:35 mtfoste2/
  4 drwx------     4 abenini                abenini              4096 Sep 19 15:33 abenini/
  4 drwx------     9 pdetremp               pdetremp             4096 Sep 19 16:49 pdetremp/
[dvicker at dvicker ~]$

From: Andreas Dilger <adilger at whamcloud.com<mailto:adilger at whamcloud.com>>
Date: Thursday, September 21, 2023 at 2:33 PM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" <darby.vicker-1 at nasa.gov<mailto:darby.vicker-1 at nasa.gov>>
Cc: "lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>" <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Subject: [EXTERNAL] Re: [lustre-discuss] Data recovery with lost MDT data

CAUTION: This email originated from outside of NASA.  Please take care when clicking links or opening attachments.  Use the "Report Message" button to report suspicious messages to the NASA SOC.

In the absence of backups, you could try LFSCK to link all of the orphan OST objects into .lustre/lost+found (see lctl-lfsck_start.8 man page for details).

The data is still in the objects, and they should have UID/GID/PRJID assigned (if used) but they have no filenames.  It would be up to you to make e.g. per-user lost+found directories in their home directories and move the files where they could access them and see if they want to keep or delete the files.

How easy/hard this is to do depends on whether the files have any content that can help identify them.

There was a Lustre hackathon project to save the Lustre JobID in a "user.job" xattr on every object, exactly to help identify the provenance of files after the fact (regardless of whether there is corruption), but it only just landed to master and will be in 2.16. That is cold comfort, but would help in the future.
Cheers, Andreas

On Sep 20, 2023, at 15:34, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> wrote:
Hello,

We have recently accidentally deleted some of our MDT data.  I think its gone for good but looking for advice to see if there is any way to recover.  Thoughts appreciated.

We run two LFS’s on the same set of hardware.  We didn’t set out to do this, but it kind of evolved.  The original setup was only a single filesystem and was all ZFS – MDT and OST’s.  Eventually, we had some small file workflows that we wanted to get better performance on.  To address this, we stood up another filesystem on the same hardware and used a an ldiskfs MDT.  However, since were already using ZFS, under the hood the storage device we build the ldisk MDT on comes from ZFS.  That gets presented to the OS as /dev/zd0.  We do a nightly backup of the MDT by cloning the ZFS dataset (this creates /dev/zd16, for whatever reason), snapshot the clone, mount that as ldiskfs, tar up the data and then destroy the snapshot and clone.  Well, occasionally this process gets interrupted, leaving the ZFS snapshot and clone hanging around.  This is where things go south.  Something happens that swaps the clone with the primary dataset.  ZFS says you’re working with the primary but its really the clone, and via versa.  This happened about a year ago and we caught it, were able to “zfs promote” to swap them back and move on.  More details on the ZFS and this mailing list here.

https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tcb8a3ef663db0031-M5a79e71768b20b2389efc4a4

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-June/018154.html

It happened again earlier this week but we didn’t remember to check this and, in an effort to get the backups going again, destroyed what we thought were the snapshot and clone.  In reality, we destroyed the primary dataset.  Even more unfortunately, the stale “snapshot” was about 3 months old.  This stale snapshot was also preventing our MDT backups from running so we don’t have those to restore from either.  (I know, we need better monitoring and alerting on this, we learned that lesson the hard way.  We had it in place after the June 2022 incident, it just wasn’t working properly.)  So at the end of the day, the data lives on the OST’s we just can’t access it due to the lost metadata.  Is there any chance at data recovery.  I don’t think so but want to explore any options.

Darby

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230925/e330dcfe/attachment-0001.htm>