[Lustre-discuss] fsck of OST problems - endless loop restarting pass 1

Craig Prescott prescott at hpc.ufl.edu
Tue Dec 1 18:01:16 PST 2009

Thanks for the reply, Andreas.

Andreas Dilger wrote:
> I would start by simply trying to mount the OST filesystem with ldiskfs 
> directly (mount options "-o ro" to avoid any further corruption or 
> errors, and possibly also "noload" to avoid recovering the journal), and 
> seeing if you can copy out the data from the filesystem into a backup 
> filesystem, and then just reformat the OST.

Unfortunately, this did not work:

[root at tebow2 ~]# mount -t ldiskfs -o ro /dev/F3P1L0/T2-F3P1L0 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/F3P1L0/T2-F3P1L0,
        missing codepage or other error
        In some cases useful info is found in syslog - try
        dmesg | tail  or so

In dmesg I see this:

LDISKFS-fs error (device dm-7): ldiskfs_check_descriptors: Checksum for 
group 256 failed (18306!=0)

LDISKFS-fs: group descriptors corrupted!

Adding "noload" to the options list did not change anything.

> You should copy out the files with a tool that has xattr support, like 
> rsync v3, or the RHEL tar using the --xattr option.
> Failing that, you may be able to e2fsck using a backup superblock and 
> group descriptor with the "-B 4096 -b {blocknr}", where:
> blocknr = 32768 * {3,5,7}^n
> I don't think the first backup group descriptor is valid (that would be 
> n=0 above, or 32768), so you could try (at random) 32768 * 3^2 = 294912.

I tried fsck with from the 1.41.6 Lustre package with the '-p' option 
with several values of n and all three values {3,5,7}.  Nearly all 
attempts look like this one - the same block is complained about 
*almost* every time:

[root at tebow2 ~]# fsck -b 294912 -B 4096 -f -p /dev/F3P1L0/T2-F3P1L0
fsck 1.41.6.sun1 (30-May-2009)
crn-OST0011: Block bitmap for group 6016 is not in group.  (block 484237063)

Seems that particular groups get complained about, FWIW, 6016 and 10112.

However, with n=1 and 7 as the multiplier, the fsck -p output was a bit 
different (different block, zeroed some checksums for group descriptors) 
- am trying an fsck with that superblock and "-y" now.

> If you can get it mounted at all you should copy the data out.  If you 
> have a very new kernel you may be able to mount the filesystem with ext4 
> (so that you don't need to re-create the journal) to copy the data out.
> For the objects in the lost+found directory ll_recover_lost_found_objs 
> will "rescue" all of these objects and put them back into the right 
> directory structure for Lustre to find them again.

Hopefully we can get it mounted and rescue the data.

We appreciate your help.

Craig Prescott
UF HPC Center

