[Lustre-discuss] OST crash with group descriptors corrupted

thhsieh thhsieh at piano.rcas.sinica.edu.tw
Tue Mar 10 03:14:18 PDT 2009


Hello,

Thanks very much for your kindly reply.

We did the e2fsck (version 1.41.4) on all the OST partitions.
Thousands of errors prompted. But now we enounter a serious
error which I have no idea to fix. Even though the e2fsck
has finished checking, one of the OST partition still has
problem. The command:

./tunefs.lustre --writeconf /dev/sdb1

shows:

checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target:     cwork2-OST0000
Index:      0
Lustre FS:  cwork2
Mount type: ldiskfs
Flags:      0x2
              (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.10.50 at tcp


   Permanent disk data:
Target:     cwork2-OST0000
Index:      0
Lustre FS:  cwork2
Mount type: ldiskfs
Flags:      0x102
              (OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.10.50 at tcp

tunefs.lustre: Unable to mount /dev/sdb1: Invalid argument

tunefs.lustre FATAL: failed to write local files
tunefs.lustre: exiting with 22 (Invalid argument)

and the kernel message prompted with the following error:

[80083.964462] LDISKFS-fs: group descriptors corrupted!
[81423.119834] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Checksum for group 11165 failed (0!=20224)


We tried e2fsck with superblock 32768, but after some error
corrections again we encounter the same problem. How could
we fix this kind of problem? In any case, we are trying to
rescue the existing data as much as possible, and reformat
the whole filesystem after that.

Is there any other information I should provide in order to
make the situation more clear? Please let me know.

I am really thanksful for your kindly suggestions.


Best Regards,

T.H.Hsieh


On Mon, Mar 09, 2009 at 02:13:15PM -0400, Brian J. Murrell wrote:
> On Mon, 2009-03-09 at 19:39 +0800, thhsieh wrote:
> > Dear All,
> > 
> > We have an emergent condition on the Lustre filesystem.
> > 
> > But today
> > we encounter the disk array hardware problem (one of the hard disk
> > of the disk array RAID 6 crashed), and soon after that the lustre
> > filesystem got crashed, too.
> 
> > The dmesg message shows:
> > 
> > [ 3314.530762] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Block bitmap for group 11152 not in group (block 3407085568)!
> > [ 3314.531701] LDISKFS-fs: group descriptors corrupted!
> 
> It looks like your disk error has resulted on an on-disk corruption.
> AFAIK, RAID is supposed to prevent this.  No idea why it didn't in this
> case.  Maybe check with your RAID vendor.
> 
> > It seems that the backend ext3 file system is still there, but has
> > errors.
> 
> Indeed.
> 
> > Could anyone suggest me a way to recover the OST partitions? Can I use
> > e2fsck to fix the problems of the OST partitions?
> 
> Yes, e2fsck should correct the problem(s).  Be aware that there is a
> possibility that the only way for e2fsck to correct the state of the
> filesystem is to (re-)move data from the filesystem.  To what extent,
> will depend completely on how much on-disk corruption has taken place.
> 
> You can get an idea of what e2fsck will do without actually doing
> anything to the disk data by giving it the "-n" argument.  You can
> decide based on that "dry-run" e2fsck output whether the corrective
> action it will take is acceptable to you.
> 
> b.
> 



> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list