[Lustre-discuss] fsck of OST problems - endless loop restarting pass 1
prescott at hpc.ufl.edu
Thu Dec 3 09:27:08 PST 2009
Craig Prescott wrote:
> Andreas Dilger wrote:
>> Hmm, the code shouldn't be checking the checksums if the uninit_bg
>> feature is not enabled. I believe this was fixed in ext4 already:
>> in ldiskfs_group_desc_csum_verify() change it to be:
>> int ldiskfs_group_desc_csum_verify(struct ext4_sb_info *sbi,
>> __u32 block_group,
>> struct ext4_group_desc *gdp)
>> if ((sbi->s_es->s_feature_ro_compat &
>> cpu_to_le32(LDISKFS_FEATURE_RO_COMPAT_GDT_CSUM)) &&
>> (gdp->bg_checksum != ldiskfs_group_desc_csum(sbi,
>> block_group, gdp)))
>> return 0;
>> return 1;
> Ok, thanks. I'll try that.
> Again, I really appreciate the help, and will let the list know how it
Sadly, we didn't have any luck with this. We had written off the OST in
our minds anyway, so to get any of the data back would have been a windfall.
Wouldn't mount as ldiskfs with the group descriptor checksum disabled:
Dec 3 10:58:05 tebow2 kernel: LDISKFS-fs error (device dm-7):
ldiskfs_check_descriptors: Block bitmap for group 10112 not in group (block
Dec 3 10:58:05 tebow2 kernel: LDISKFS-fs: group descriptors corrupted!
Disabling that check and trying to mount yielded this one:
Dec 3 11:01:13 tebow2 kernel: LDISKFS-fs error (device dm-7):
ldiskfs_check_descriptors: Inode bitmap for group 10112 not in group (block
Dec 3 11:01:13 tebow2 kernel: LDISKFS-fs: group descriptors corrupted!
Disabling that check yielded this one:
Dec 3 11:01:59 tebow2 kernel: LDISKFS-fs error (device dm-7):
ldiskfs_check_descriptors: Inode table for group 10112 not in group (block
Dec 3 11:01:59 tebow2 kernel: LDISKFS-fs: group descriptors corrupted!
All these messages were seen repeatedly in our fsck attempts. If we had
been able to get past this group, several thousand more would have followed.
Disabling the inode table present in group check:
Dec 3 11:02:35 tebow2 kernel: ldiskfs: No journal on filesystem on dm-7
At that point we tried to rewrite superblocks with mkfs.lustre and
--mkfsoptions="-S", which panic'd the OSS. At that point, we gave up.
Though it didn't work out this time, we'll be in a better position to be
successful if this happens ever again.
UF HPC Center
More information about the lustre-discuss