[lustre-discuss] [EXTERNAL] Cannot mount MDT after upgrading from Lustre 2.12.6 to 2.15.3

Tue Sep 26 06:41:37 PDT 2023

We did run "tunefs.lustre --writeconf <dev>" for all the MDT and OST
partitions in each file servers. But after that, trying to mount MDT
resulted that error message. Note that doing tunefs.lustre --writeconf as
the final step to upgrade Lustre file system was followed the instructions
in Lustre documentation.

After a lot of struggle, we finally cure this problem by the following
procedure. This was based on two assumptions:
1. During upgrade, somehow some important config files in MDT were not
generated and missing.
2. During upgrade, somehow some important config files in MDT were
corrupted, but hopefully they could be regenerated.

Therefore, we followed the procedure to move data in MDT to another device
in the ldiskfs mount. Since this procedure involves rsync all the data in
the original MDT to the newly created MDT partition, we hope that if any
one of the above assumptions stands, hopefully we could get back the
missing / corrupted config by this way. The procedure exactly follows the
Lustre documentation of manually backing up MDT to another device.
1. Mount the broken MDT to /mnt via ldiskfs.
2. Find an empty partition, use mkfs.lustre to create a new MDT with
ldiskfs backend, with the same file system name and the same index as the
broken MDT. Then we mount it to  /mnt2 via ldiskfs.
3. Use getfattr to extract the extended file attributes of all files in
/mnt.
4. Use rsync -av --sparse to backup everything from /mnt to /mnt2.
5. Restore the extended file attributes of all files in /mnt2 by setfattr.
6. Remove the log files in /mnt2/, i.e., rm -rf oi.16* lfsck_* LFSCK CATALOG

Then we umount /mnt and /mnt2, trying to mount the newly created MTD. The
error message told that the index 0 was already assigned to the original
MDT. We should run tunefs.lustre --writeconf to clear it again. After
running tunefs.lustre,
we were very lucky to mount MDT back.

Now we have recovered the whole Lustre file system. But I still quite worry
that there might still have potential issues, since I am not sure whether I
did it correctly to solve this problem. So I am still watching the system
closely. If we were really so lucky that this problem was cured by this
way, then probably it could provide an opportunity of rescuing a broken MDT
if unfortunately we could not find any solutions.

Best Regards,

T.H.Hsieh

Mohr, Rick <mohrrf at ornl.gov> 於 2023年9月26日 週二 下午2:08寫道：

> Typically after an upgrade you do not need to perform a writeconf.  Did
> you perform the writeconf only on the MDT?  If so, that could be your
> problem.  When you do a writeconf to regenerate the lustre logs, you need
> to follow the whole procedure listed in the lustre manual.  You can try
> that to see if it fixes your issue.
>
> --Rick
>
> On 9/23/23, 2:22 PM, "lustre-discuss on behalf of Tung-Han Hsieh via
> lustre-discuss" <lustre-discuss-bounces at lists.lustre.org <mailto:
> lustre-discuss-bounces at lists.lustre.org> on behalf of
> lustre-discuss at lists.lustre.org <mailto:lustre-discuss at lists.lustre.org>>
> wrote:
>
>
> Dear All,
>
>
> Today we tried to upgrade Lustre file system from version 2.12.6 to
> 2.15.3. But after the work, we cannot mount MDT successfully. Our MDT is
> ldiskfs backend. The procedure of upgrade is
>
>
>
>
> 1. Install the new version of e2fsprogs-1.47.0
> 2. Install Lustre-2.15.3
> 3. After reboot, run: tunefs.lustre --writeconf /dev/md0
>
>
>
>
> Then when mounting MDT, we got the error message in dmesg:
>
>
>
>
> ===========================================================
> [11662.434724] LDISKFS-fs (md0): mounted filesystem with ordered data
> mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
> [11662.584593] Lustre: 3440:0:(scrub.c:189:scrub_file_load())
> chome-MDT0000: reset scrub OI count for format change (LU-16655)
> [11666.036253] Lustre: MGS: Logs for fs chome were removed by user
> request. All servers must be restarted in order to regenerate the logs: rc
> = 0
> [11666.523144] Lustre: chome-MDT0000: Imperative Recovery not enabled,
> recovery window 300-900
> [11666.594098] LustreError: 3440:0:(mdd_device.c:1355:mdd_prepare())
> chome-MDD0000: get default LMV of root failed: rc = -2
> [11666.594291] LustreError:
> 3440:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start
> targets: -2
> [11666.594951] Lustre: Failing over chome-MDT0000
> [11672.868438] Lustre: 3440:0:(client.c:2295:ptlrpc_expire_one_request())
> @@@ Request sent has timed out for slow reply: [sent 1695492248/real
> 1695492248] req at 000000005dfd9b53 x1777852464760768/t0(0)
> o251->MGC192.168.32.240 at o2ib@0 at lo:26/25 lens 224/224 e 0 to 1 dl
> 1695492254 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
> [11672.925905] Lustre: server umount chome-MDT0000 complete
> [11672.926036] LustreError: 3440:0:(super25.c:183:lustre_fill_super())
> llite: Unable to mount <unknown>: rc = -2
> [11872.893970] LDISKFS-fs (md0): mounted filesystem with ordered data
> mode. Opts: (null)
>
>
> ============================================================
>
>
>
>
> Could anyone help to solve this problem ? Sorry that it is really urgent.
>
>
>
>
> Thank you very much.
>
>
>
>
> T.H.Hsieh
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230926/ceec5380/attachment-0001.htm>