[Lustre-discuss] fsck of OST problems - endless loop restarting pass 1

Craig Prescott prescott at hpc.ufl.edu
Tue Dec 1 18:01:16 PST 2009

Thanks for the reply, Andreas.

Andreas Dilger wrote:
> I would start by simply trying to mount the OST filesystem with ldiskfs 
> directly (mount options "-o ro" to avoid any further corruption or 
> errors, and possibly also "noload" to avoid recovering the journal), and 
> seeing if you can copy out the data from the filesystem into a backup 
> filesystem, and then just reformat the OST.

Unfortunately, this did not work:

[root at tebow2 ~]# mount -t ldiskfs -o ro /dev/F3P1L0/T2-F3P1L0 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/F3P1L0/T2-F3P1L0,
        missing codepage or other error
        In some cases useful info is found in syslog - try
        dmesg | tail  or so

In dmesg I see this:

LDISKFS-fs error (device dm-7): ldiskfs_check_descriptors: Checksum for 
group 256 failed (18306!=0)

LDISKFS-fs: group descriptors corrupted!

Adding "noload" to the options list did not change anything.

> You should copy out the files with a tool that has xattr support, like 
> rsync v3, or the RHEL tar using the --xattr option.
> Failing that, you may be able to e2fsck using a backup superblock and 
> group descriptor with the "-B 4096 -b {blocknr}", where:
> blocknr = 32768 * {3,5,7}^n
> I don't think the first backup group descriptor is valid (that would be 
> n=0 above, or 32768), so you could try (at random) 32768 * 3^2 = 294912.

I tried fsck with from the 1.41.6 Lustre package with the '-p' option 
with several values of n and all three values {3,5,7}.  Nearly all 
attempts look like this one - the same block is complained about 
*almost* every time:

[root at tebow2 ~]# fsck -b 294912 -B 4096 -f -p /dev/F3P1L0/T2-F3P1L0
fsck 1.41.6.sun1 (30-May-2009)
crn-OST0011: Block bitmap for group 6016 is not in group.  (block 484237063)

Seems that particular groups get complained about, FWIW, 6016 and 10112.

However, with n=1 and 7 as the multiplier, the fsck -p output was a bit 
different (different block, zeroed some checksums for group descriptors) 
- am trying an fsck with that superblock and "-y" now.

> If you can get it mounted at all you should copy the data out.  If you 
> have a very new kernel you may be able to mount the filesystem with ext4 
> (so that you don't need to re-create the journal) to copy the data out.
> For the objects in the lost+found directory ll_recover_lost_found_objs 
> will "rescue" all of these objects and put them back into the right 
> directory structure for Lustre to find them again.

Hopefully we can get it mounted and rescue the data.

We appreciate your help.

Craig Prescott
UF HPC Center

More information about the lustre-discuss mailing list