[Lustre-discuss] OSS: bad header in inode - invalid magic
Michael Sternberg
sternberg at anl.gov
Tue Jul 1 17:52:29 PDT 2008
Hi,
I repeatedly encounter "invalid magic" in one particular inode of one
of my OSS volumes (1 of 4, each 5 TB), with the consequence of lustre
remounting R/O.
I run 2.6.18-53.1.13.el5_lustre.1.6.4.3smp on RHEL5.1 on a cluster
with approx. 150 client nodes.
The error appears on the OSS as:
Jul 1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3):
ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic
- magic 0, entries 0, max 0(0), depth 0(0)
Jul 1 15:43:58 oss01 kernel: Remounting filesystem read-only
Jul 1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3):
ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic
- magic 0, entries 0, max 0(0), depth 0(0)
Jul 1 15:43:58 oss01 kernel: LustreError: 25462:0:(fsfilt-ldiskfs.c:
417:fsfilt_ldiskfs_brw_start()) can't get handle for 45 credits: rc =
-30
Jul 1 15:43:58 oss01 kernel: LustreError: 25462:0:(fsfilt-ldiskfs.c:
417:fsfilt_ldiskfs_brw_start()) Skipped 6 previous similar messages
Jul 1 15:43:58 oss01 kernel: LustreError: 25462:0:(filter_io_26.c:
705:filter_commitrw_write()) error starting transaction: rc = -30
Jul 1 15:43:58 oss01 kernel: LustreError: 19569:0:(filter_io_26.c:
705:filter_commitrw_write()) error starting transaction: rc = -30
[... many repeats]
Three login nodes signaled, about 10 .. 15 minutes apart the same
wall(8) message:
Message from syslogd@ at Tue Jul 1 16:00:02 2008 ...
login1 kernel: LustreError: 5612:0:(ptlrpcd.c:72:ptlrpcd_wake())
ASSERTION(pc != NULL) failed
Message from syslogd@ at Tue Jul 1 16:00:02 2008 ...
login1 kernel: LustreError: 5612:0:(tracefile.c:
431:libcfs_assertion_failed()) LBUG
Twice in the past, I followed this recovery procedure from the Manual
and the Wiki:
http://wiki.lustre.org/index.php?title=Fsck_Support#Using_e2fsck_on_a_backing_filesystem%7Cusing
Using e2fsck on a backing filesystem
-- nice walk-through
http://manual.lustre.org/manual/LustreManual16_HTML/Failover.html#50446391_pgfId-1287654
8.4.1 Starting/Stopping a Resource
[i.e., simply umounting the device on the OSS - is this correct?]
http://manual.lustre.org/manual/LustreManual16_HTML/LustreInstallation.html#50446385_43530
4.2.1.5 Stopping a Server
In other words:
umount the OSS
perform fsck on the block device
remount the OSS
So, last time I did:
[root at oss01 ~]# umount /mnt/ost2
[root at oss01 ~]# e2fsck -fp /dev/dm-3
lustre-OST0002: recovering journal
lustre-OST0002: ext3 recovery flag is clear, but journal has data.
lustre-OST0002: Run journal anyway
lustre-OST0002: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
[root at oss01 ~]# mount -t ldiskfs /dev/dm-3 /mnt/ost2
[root at oss01 ~]# umount /mnt/ost2
[root at oss01 ~]# e2fsck -fp /dev/dm-3
lustre-OST0002: 342355/427253760 files (4.2% non-contiguous),
139324997/1708984375 blocks
To my surprise, there were no errors. I did the same today after the
error above, but left out the "-p" flag; still, fsck did not find an
error (except the journal replay??):
[root at oss01 ~]# e2fsck -f /dev/dm-3
e2fsck 1.40.4.cfs1 (31-Dec-2007)
lustre-OST0002: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
lustre-OST0002: ***** FILE SYSTEM WAS MODIFIED *****
lustre-OST0002: 343702/427253760 files (4.4% non-contiguous),
137003893/1708984375 blocks
[root at oss01 ~]#
I haven't mounted back yet for fear this would stall the system again
in a couple of days.
How can I locate the "bad" inode - should I try? Is this an inode of
the lustre FS or the underlying ext3 on the OST?
Are there version dependencies of e2fsck with lustre? I am running
lustre-1.6.4.3 and e2fsck-1.40.4.
I would appreciate any pointers.
Thank you for your attention and help.
Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080701/0c5c1e84/attachment.htm>
More information about the lustre-discuss
mailing list