[Lustre-discuss] OSS: bad header in inode - invalid magic

Michael Sternberg sternberg at anl.gov
Tue Jul 1 17:52:29 PDT 2008


Hi,

I repeatedly encounter "invalid magic" in one particular inode  of one  
of my OSS volumes (1 of 4, each 5 TB), with the consequence of lustre  
remounting R/O.

I run 2.6.18-53.1.13.el5_lustre.1.6.4.3smp on RHEL5.1 on a cluster  
with approx. 150 client nodes.

The error appears on the OSS as:

Jul  1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3):  
ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic  
- magic 0, entries 0, max 0(0), depth 0(0)
Jul  1 15:43:58 oss01 kernel: Remounting filesystem read-only
Jul  1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3):  
ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic  
- magic 0, entries 0, max 0(0), depth 0(0)
Jul  1 15:43:58 oss01 kernel: LustreError: 25462:0:(fsfilt-ldiskfs.c: 
417:fsfilt_ldiskfs_brw_start()) can't get handle for 45 credits: rc =  
-30
Jul  1 15:43:58 oss01 kernel: LustreError: 25462:0:(fsfilt-ldiskfs.c: 
417:fsfilt_ldiskfs_brw_start()) Skipped 6 previous similar messages
Jul  1 15:43:58 oss01 kernel: LustreError: 25462:0:(filter_io_26.c: 
705:filter_commitrw_write()) error starting transaction: rc = -30
Jul  1 15:43:58 oss01 kernel: LustreError: 19569:0:(filter_io_26.c: 
705:filter_commitrw_write()) error starting transaction: rc = -30
[... many repeats]


Three login nodes signaled, about 10 .. 15 minutes apart the same  
wall(8) message:

Message from syslogd@ at Tue Jul 1 16:00:02 2008 ...
login1 kernel: LustreError: 5612:0:(ptlrpcd.c:72:ptlrpcd_wake())  
ASSERTION(pc != NULL) failed
Message from syslogd@ at Tue Jul 1 16:00:02 2008 ...
login1 kernel: LustreError: 5612:0:(tracefile.c: 
431:libcfs_assertion_failed()) LBUG



Twice in the past, I followed this recovery procedure from the Manual  
and the Wiki:

     http://wiki.lustre.org/index.php?title=Fsck_Support#Using_e2fsck_on_a_backing_filesystem%7Cusing
	Using e2fsck on a backing filesystem
	-- nice walk-through

     http://manual.lustre.org/manual/LustreManual16_HTML/Failover.html#50446391_pgfId-1287654
	8.4.1 Starting/Stopping a Resource

	[i.e., simply umounting the device on the OSS - is this correct?]

     http://manual.lustre.org/manual/LustreManual16_HTML/LustreInstallation.html#50446385_43530
	4.2.1.5	Stopping a Server


In other words:
	umount the OSS
	perform fsck on the block device
	remount the OSS

So, last time I did:

	[root at oss01 ~]# umount /mnt/ost2
	[root at oss01 ~]# e2fsck -fp /dev/dm-3

	lustre-OST0002: recovering journal
	lustre-OST0002: ext3 recovery flag is clear, but journal has data.
	lustre-OST0002: Run journal anyway

	lustre-OST0002: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
         	(i.e., without -a or -p options)

	[root at oss01 ~]# mount -t ldiskfs /dev/dm-3 /mnt/ost2
	[root at oss01 ~]# umount /mnt/ost2
	[root at oss01 ~]# e2fsck -fp /dev/dm-3
	lustre-OST0002: 342355/427253760 files (4.2% non-contiguous),  
139324997/1708984375 blocks


To my surprise, there were no errors.  I did the same today after the  
error above, but left out the "-p" flag; still, fsck did not find an  
error (except the journal replay??):

	[root at oss01 ~]# e2fsck -f /dev/dm-3
	e2fsck 1.40.4.cfs1 (31-Dec-2007)
	lustre-OST0002: recovering journal
	Pass 1: Checking inodes, blocks, and sizes
	Pass 2: Checking directory structure
	Pass 3: Checking directory connectivity
	Pass 4: Checking reference counts
	Pass 5: Checking group summary information

	lustre-OST0002: ***** FILE SYSTEM WAS MODIFIED *****
	lustre-OST0002: 343702/427253760 files (4.4% non-contiguous),  
137003893/1708984375 blocks
	[root at oss01 ~]#

I haven't mounted back yet for fear this would stall the system again  
in a couple of days.


How can I locate the "bad" inode - should I try?  Is this an inode of  
the lustre FS or the underlying ext3 on the OST?

Are there version dependencies of e2fsck with lustre?  I am running  
lustre-1.6.4.3 and e2fsck-1.40.4.


I would appreciate any pointers.


Thank you for your attention and help.
Michael

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080701/0c5c1e84/attachment.htm>


More information about the lustre-discuss mailing list