[lustre-discuss] OST not recognized as lustre volume. - group descriptors corrupted

Scott Wood woodystrash at hotmail.com
Sat May 7 02:04:56 PDT 2022

Hey folks,

We had a power "incident".  Not sure if it was the cause of our issues or if it just brought previous issues to light.  We're a CentOS 7, lustre 2.10.6-1.el7 (from provided binaries) site, SAS direct connect HA paired OSSs running pacemaker to manage failover.  Standard stuff.  After the power incident, some OSTS were dropped and remounted but one did not come back.  At this point, that OST does not seem to be recognized as a lustre volume.

First step I took was to disable the pacemaker resource and try to mount it manually to see how it was doing:

[root at hpcoss02 ~]# mount -t lustre /dev/mapper/mpathg /mnt/OST78
mount.lustre: /dev/mapper/mpathg has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool

The syslog shows the following at that time (syslog is from a subsequent attempt but logs match):
May 07 12:04:15 hpcoss02.adqimr.ad.lan kernel: LDISKFS-fs (dm-6): ldiskfs_check_descriptors: Checksum for group 192 failed (39981!=25867)
May 07 12:04:15 hpcoss02.adqimr.ad.lan kernel: LDISKFS-fs (dm-6): group descriptors corrupted!

No fun.  Next attempt was to try mounting ldiskfs in case a journal replay would help:

[root at hpcoss02 ~]# mount -t ldiskfs /dev/mapper/mpathg /mnt/OST78
mount: wrong fs type, bad option, bad superblock on /dev/mapper/mpathg,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

Unhappy chappie.  This OST has been up and running happily for months so it knows it's part of the stack.  I ran a "tune2fs -l" against it and, from the "Filesystem magic number:  0xEF53", it looks like it knows it's a lustre volume.  I ran an "e2fsck -n" against it.  I'll spare you the details but am happy to answer specifics if you have ideas about what to look for but it does not look good.  stdout went to "out" stderr went to "err".  "out" shows the following and I can dig deeper or answer questions:

[root at hpcoss01 fsck]# grep "Group descriptor.*checksum is.*, should be" out |head -n1
Group descriptor 192 checksum is 0x650b, should be 0x9c2d.  IGNORED.
[root at hpcoss01 fsck]# grep "Group descriptor.*checksum is.*, should be" out |wc -l
[root at hpcoss01 fsck]# grep "Free blocks count wrong for group" out |head -n1
Free blocks count wrong for group #192 (32768, counted=0).
[root at hpcoss01 fsck]# grep "Free blocks count wrong for group" out |wc -l

"err" showed quote issues but we're not too concerned about them as we don't enforce.

We are currently replicating the block device of the OST to Logical Volume so we can run non-destructive tests against an LVM snapshot to see what we get (thanks @stu for the suggestion).  We're also running an "lfs find mountpoint -obd lustre-OST004e" to get a list of the files that could be lost.  Once we have a usable copy of the OST, we intend to "e2fscl -fy" the snapshot to see of the opject come back or go to lost+found.  If they go to lost+found, we're considering replicating the MDT and MGT in a sandbox, mounting them and the OST and "lfsck"ing the OST to see if the MDT knows how to move the lost objects out of the lost+found to their happy places.

Are there any other troubleshooting steps we can take while we wait for the OST block device to be copied (that'll take a bit) for our test e2fsck?  Is there any output from the "tune2fs -l" or "e2fsck" that we can provide that could shed any light on the issue and provide possible solutions? Any other tips and tricks?  Thanks in advance for any insight.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220507/5f5fc281/attachment.html>

More information about the lustre-discuss mailing list