[Lustre-discuss] fsck of OST problems - endless loop restarting pass 1
Andreas Dilger
adilger at sun.com
Tue Dec 1 18:43:44 PST 2009
On 2009-12-01, at 19:01, Craig Prescott wrote:
> Andreas Dilger wrote:
>> I would start by simply trying to mount the OST filesystem with
>> ldiskfs directly (mount options "-o ro" to avoid any further
>> corruption or errors, and possibly also "noload" to avoid
>> recovering the journal), and seeing if you can copy out the data
>> from the filesystem into a backup filesystem, and then just
>> reformat the OST.
>
> Unfortunately, this did not work:
>
> [root at tebow2 ~]# mount -t ldiskfs -o ro /dev/F3P1L0/T2-F3P1L0 /mnt
> mount: wrong fs type, bad option, bad superblock on /dev/F3P1L0/T2-
> F3P1L0,
> missing codepage or other error
> In some cases useful info is found in syslog - try
> dmesg | tail or so
>
> In dmesg I see this:
>
> LDISKFS-fs error (device dm-7): ldiskfs_check_descriptors: Checksum
> for group 256 failed (18306!=0)
> LDISKFS-fs: group descriptors corrupted!
You may want to disable the group descriptor checksums with:
debugfs -R "feature ^uninit_bg" {dev}
and then retry the mount and/or e2fsck. This feature is making it
more difficult to use the backup descriptors for some reason.
>
> Adding "noload" to the options list did not change anything.
>
>> You should copy out the files with a tool that has xattr support,
>> like rsync v3, or the RHEL tar using the --xattr option.
>> Failing that, you may be able to e2fsck using a backup superblock
>> and group descriptor with the "-B 4096 -b {blocknr}", where:
>> blocknr = 32768 * {3,5,7}^n
>> I don't think the first backup group descriptor is valid (that
>> would be n=0 above, or 32768), so you could try (at random) 32768 *
>> 3^2 = 294912.
>
> I tried fsck with from the 1.41.6 Lustre package with the '-p'
> option with several values of n and all three values {3,5,7}.
> Nearly all attempts look like this one - the same block is
> complained about *almost* every time:
>
> [root at tebow2 ~]# fsck -b 294912 -B 4096 -f -p /dev/F3P1L0/T2-F3P1L0
> fsck 1.41.6.sun1 (30-May-2009)
> crn-OST0011: Block bitmap for group 6016 is not in group. (block
> 484237063)
>
> Seems that particular groups get complained about, FWIW, 6016 and
> 10112.
>
> However, with n=1 and 7 as the multiplier, the fsck -p output was a
> bit different (different block, zeroed some checksums for group
> descriptors) - am trying an fsck with that superblock and "-y" now.
>
>> If you can get it mounted at all you should copy the data out. If
>> you have a very new kernel you may be able to mount the filesystem
>> with ext4 (so that you don't need to re-create the journal) to copy
>> the data out.
>> For the objects in the lost+found directory
>> ll_recover_lost_found_objs will "rescue" all of these objects and
>> put them back into the right directory structure for Lustre to find
>> them again.
>
> Hopefully we can get it mounted and rescue the data.
>
> We appreciate your help.
>
> Thanks,
> Craig Prescott
> UF HPC Center
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list