[Lustre-discuss] fsck of OST problems - endless loop restarting pass 1

Craig Prescott prescott at hpc.ufl.edu
Thu Dec 3 09:27:08 PST 2009


Craig Prescott wrote:
> Andreas Dilger wrote:
>> Hmm, the code shouldn't be checking the checksums if the uninit_bg
>> feature is not enabled.  I believe this was fixed in ext4 already:
>>
>> in ldiskfs_group_desc_csum_verify() change it to be:
>>
>> int ldiskfs_group_desc_csum_verify(struct ext4_sb_info *sbi,
>>                                    __u32 block_group,
>>                                    struct ext4_group_desc *gdp)
>> {
>>         if ((sbi->s_es->s_feature_ro_compat &
>>              cpu_to_le32(LDISKFS_FEATURE_RO_COMPAT_GDT_CSUM)) &&
>>             (gdp->bg_checksum != ldiskfs_group_desc_csum(sbi, 
>> block_group, gdp)))
>>                 return 0;
>>         return 1;
>> }
> 
> Ok, thanks.  I'll try that.
> 
<snip>
> Again, I really appreciate the help, and will let the list know how it 
> goes.

Sadly, we didn't have any luck with this.  We had written off the OST in 
our minds anyway, so to get any of the data back would have been a windfall.

Wouldn't mount as ldiskfs with the group descriptor checksum disabled:

Dec  3 10:58:05 tebow2 kernel: LDISKFS-fs error (device dm-7):
ldiskfs_check_descriptors: Block bitmap for group 10112 not in group (block
484237063)!
Dec  3 10:58:05 tebow2 kernel: LDISKFS-fs: group descriptors corrupted!

Disabling that check and trying to mount yielded this one:

Dec  3 11:01:13 tebow2 kernel: LDISKFS-fs error (device dm-7):
ldiskfs_check_descriptors: Inode bitmap for group 10112 not in group (block
14342712)!
Dec  3 11:01:13 tebow2 kernel: LDISKFS-fs: group descriptors corrupted!

Disabling that check yielded this one:

Dec  3 11:01:59 tebow2 kernel: LDISKFS-fs error (device dm-7):
ldiskfs_check_descriptors: Inode table for group 10112 not in group (block
3538357782)!
Dec  3 11:01:59 tebow2 kernel: LDISKFS-fs: group descriptors corrupted!

All these messages were seen repeatedly in our fsck attempts.  If we had 
been able to get past this group, several thousand more would have followed.

Disabling the inode table present in group check:

Dec  3 11:02:35 tebow2 kernel: ldiskfs: No journal on filesystem on dm-7

At that point we tried to rewrite superblocks with mkfs.lustre and 
--mkfsoptions="-S", which panic'd the OSS.  At that point, we gave up.

Though it didn't work out this time, we'll be in a better position to be 
successful if this happens ever again.

Thanks,
Craig Prescott
UF HPC Center



More information about the lustre-discuss mailing list