<div dir="ltr">We did run "tunefs.lustre --writeconf <dev>" for all the MDT and OST partitions in each file servers. But after that, trying to mount MDT resulted that error message. Note that doing tunefs.lustre --writeconf as the final step to upgrade Lustre file system was followed the instructions in Lustre documentation.<div><br></div><div>After a lot of struggle, we finally cure this problem by the following procedure. This was based on two assumptions:</div><div>1. During upgrade, somehow some important config files in MDT were not generated and missing.</div><div>2. During upgrade, somehow some important config files in MDT were corrupted, but hopefully they could be regenerated.</div><div><br></div><div>Therefore, we followed the procedure to move data in MDT to another device in the ldiskfs mount. Since this procedure involves rsync all the data in the original MDT to the newly created MDT partition, we hope that if any one of the above assumptions stands, hopefully we could get back the missing / corrupted config by this way. The procedure exactly follows the Lustre documentation of manually backing up MDT to another device.</div><div>1. Mount the broken MDT to /mnt via ldiskfs.</div><div>2. Find an empty partition, use mkfs.lustre to create a new MDT with ldiskfs backend, with the same file system name and the same index as the broken MDT. Then we mount it to /mnt2 via ldiskfs.</div><div>3. Use getfattr to extract the extended file attributes of all files in /mnt.</div><div>4. Use rsync -av --sparse to backup everything from /mnt to /mnt2.</div><div>5. Restore the extended file attributes of all files in /mnt2 by setfattr.</div><div>6. Remove the log files in /mnt2/, i.e., rm -rf oi.16* lfsck_* LFSCK CATALOG</div><div><br></div><div>Then we umount /mnt and /mnt2, trying to mount the newly created MTD. The error message told that the index 0 was already assigned to the original MDT. We should run tunefs.lustre --writeconf to clear it again. After running tunefs.lustre,</div><div>we were very lucky to mount MDT back.</div><div><br></div><div>Now we have recovered the whole Lustre file system. But I still quite worry that there might still have potential issues, since I am not sure whether I did it correctly to solve this problem. So I am still watching the system closely. If we were really so lucky that this problem was cured by this way, then probably it could provide an opportunity of rescuing a broken MDT if unfortunately we could not find any solutions.</div><div><br></div><div>Best Regards,</div><div><br></div><div>T.H.Hsieh</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Mohr, Rick <<a href="mailto:mohrrf@ornl.gov">mohrrf@ornl.gov</a>> 於 2023年9月26日 週二 下午2:08寫道:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Typically after an upgrade you do not need to perform a writeconf. Did you perform the writeconf only on the MDT? If so, that could be your problem. When you do a writeconf to regenerate the lustre logs, you need to follow the whole procedure listed in the lustre manual. You can try that to see if it fixes your issue.<br>
<br>
--Rick<br>
<br>
On 9/23/23, 2:22 PM, "lustre-discuss on behalf of Tung-Han Hsieh via lustre-discuss" <<a href="mailto:lustre-discuss-bounces@lists.lustre.org" target="_blank">lustre-discuss-bounces@lists.lustre.org</a> <mailto:<a href="mailto:lustre-discuss-bounces@lists.lustre.org" target="_blank">lustre-discuss-bounces@lists.lustre.org</a>> on behalf of <a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a> <mailto:<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a>>> wrote:<br>
<br>
<br>
Dear All,<br>
<br>
<br>
Today we tried to upgrade Lustre file system from version 2.12.6 to 2.15.3. But after the work, we cannot mount MDT successfully. Our MDT is ldiskfs backend. The procedure of upgrade is<br>
<br>
<br>
<br>
<br>
1. Install the new version of e2fsprogs-1.47.0<br>
2. Install Lustre-2.15.3<br>
3. After reboot, run: tunefs.lustre --writeconf /dev/md0<br>
<br>
<br>
<br>
<br>
Then when mounting MDT, we got the error message in dmesg:<br>
<br>
<br>
<br>
<br>
===========================================================<br>
[11662.434724] LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc<br>
[11662.584593] Lustre: 3440:0:(scrub.c:189:scrub_file_load()) chome-MDT0000: reset scrub OI count for format change (LU-16655)<br>
[11666.036253] Lustre: MGS: Logs for fs chome were removed by user request. All servers must be restarted in order to regenerate the logs: rc = 0<br>
[11666.523144] Lustre: chome-MDT0000: Imperative Recovery not enabled, recovery window 300-900<br>
[11666.594098] LustreError: 3440:0:(mdd_device.c:1355:mdd_prepare()) chome-MDD0000: get default LMV of root failed: rc = -2<br>
[11666.594291] LustreError: 3440:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -2<br>
[11666.594951] Lustre: Failing over chome-MDT0000<br>
[11672.868438] Lustre: 3440:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1695492248/real 1695492248] req@000000005dfd9b53 x1777852464760768/t0(0) o251->MGC192.168.32.240@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1695492254 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''<br>
[11672.925905] Lustre: server umount chome-MDT0000 complete<br>
[11672.926036] LustreError: 3440:0:(super25.c:183:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -2<br>
[11872.893970] LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: (null)<br>
<br>
<br>
============================================================<br>
<br>
<br>
<br>
<br>
Could anyone help to solve this problem ? Sorry that it is really urgent.<br>
<br>
<br>
<br>
<br>
Thank you very much.<br>
<br>
<br>
<br>
<br>
T.H.Hsieh<br>
<br>
<br>
<br>
<br>
<br>
</blockquote></div>