[Lustre-discuss] fsck of OST problems - endless loop restarting pass 1

Craig Prescott prescott at hpc.ufl.edu
Tue Dec 1 12:56:51 PST 2009


Hope someone can help us out with this one.

We are running Lustre 1.8.1.1.  One of our two OSS nodes (12 OSTs) 
become unresponsive on Sunday night.  We issued an IPMI power cycle.

After the node was back up, we tried to fsck the OSTs 
(e2fsprogs-1.41.6.sun1-0redhat.x86_64) with 'fsck -f -y'.  Eleven of the 
twelve OSTs fsck'd normally.  The 12th OST showed heavy corruption, with 
many inodes moved to /lost+found.  This fsck never finished, and we 
killed it after ~14 hours.

All further fsck attempts seem to endlessly get kicked back to pass 1 
after many zero dtime corrections, and relocating many group block 
bitmaps, inode bitmaps, and inode tables.  It seems that many of these 
changes are never written out to the filesystem, as we encounter the 
same corrections on subsequent pass 1 restarts.  Actually, it looks like 
every *other* attempt to run pass 1 yields similar output, as if fsck is 
bouncing back and forth between two solutions.

We have tried e2fsprogs 1.41.6.sun1-0redhat and 1.41.9 from sourceforge. 
   Logs (enormous) of the fsck attempts are available here:

http://hpc.ufl.edu/logs/fsck.log.1.41.9.gz (2 full pass 1 fsck attempts)
http://hpc.ufl.edu/logs/fsck.log.1.41.6.gz (4 full pass 1 fsck attempts)

Can any part of this OST be salvaged?

Thanks,
Craig Prescott
UF HPC Center


 From the initial fsck:

fsck.ext4: Group descriptors look bad... trying backup blocks...
Superblock has an invalid journal (inode 8).
Clear? yes

*** ext3 journal has been deleted - filesystem is now ext2 only ***

Superblock has_journal flag is clear, but a journal inode is present.
Clear? yes

Pass 1: Checking inodes, blocks, and sizes
Journal inode is not in use, but contains data.  Clear? yes


Inodes that were part of a corrupted orphan linked list found.  Fix? yes

Inode 32784385 was part of the orphaned inode list.  FIXED.
Inode 32784385 has imagic flag set.  Clear? yes

...

File ??? (inode #114786307, mod time Fri Oct 10 14:03:48 2008)
   has 506488 multiply-claimed block(s), shared with 7 file(s):
         ??? (inode #114786319, mod time Fri Oct 10 14:03:48 2008)
         ... (inode #114786317, mod time Fri Oct 10 14:03:48 2008)
         ... (inode #114786315, mod time Fri Oct 10 14:03:48 2008)
         ??? (inode #114786313, mod time Fri Oct 10 14:03:48 2008)
         ... (inode #114786311, mod time Fri Oct 10 14:03:48 2008)
         ... (inode #114786309, mod time Fri Oct 10 14:03:48 2008)
         ??? (inode #114786305, mod time Fri Oct 10 14:03:48 2008)
Clone multiply-claimed blocks? yes

...



More information about the lustre-discuss mailing list