[Lustre-discuss] Lustre Mount Crashing
Charles Taylor
taylor at hpc.ufl.edu
Mon Jun 2 08:35:35 PDT 2008
Well, I figured someone would ask that. :) The last messages that
make it to syslog prior to the crash are....
Jun 2 10:29:54 hpcmds kernel: LDISKFS FS on md2, internal journal
Jun 2 10:29:54 hpcmds kernel: LDISKFS-fs: recovery complete.
Jun 2 10:29:54 hpcmds kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
Jun 2 10:29:54 hpcmds kernel: kjournald starting. Commit interval 5
seconds
Jun 2 10:29:54 hpcmds kernel: LDISKFS FS on md2, internal journal
Jun 2 10:29:54 hpcmds kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
Jun 2 10:29:54 hpcmds kernel: Lustre: MGS MGS started
Jun 2 10:29:54 hpcmds kernel: Lustre: Enabling user_xattr
Jun 2 10:29:54 hpcmds kernel: Lustre: 4540:0:(mds_fs.c:
446:mds_init_server_data()) RECOVERY: service ufhpc-MDT0000, 100
recoverable clients, last_transno 9412464331
Jun 2 10:29:54 hpcmds kernel: Lustre: MDT ufhpc-MDT0000 now serving
dev (ufhpc-MDT0000/cac99db5-a66a-a6ac-4649-6ec8cc2dc0e7), but will be
in recovery until 100 clients reconnect, or if no clients reconnect
for 4:10; during that time new clients will not be allowed to connect.
Recovery progress can be monitored by watching /proc/fs/lustre/mds/
ufhpc-MDT0000/recovery_status.
Jun 2 10:29:55 hpcmds kernel: Lustre: 4540:0:(mds_lov.c:
858:mds_notify()) MDS ufhpc-MDT0000: in recovery, not resetting
orphans on ufhpc-OST0004_UUID
Jun 2 10:29:55 hpcmds kernel: Lustre: 4540:0:(mds_lov.c:
858:mds_notify()) MDS ufhpc-MDT0000: in recovery, not resetting
orphans on ufhpc-OST0005_UUID
Note that all of the clients are powered off and the OSS's are
currently unmounted (though they appear to be fine).
Unfortunately, getting the messages off the console (in the machine
room) means using a pencil and paper (you'd think we have something as
fancy as a ip-kvm console server, but alas, we do things, ahem,
"inexpensively" here. I'm going to let the md mirrors resync before
I try it again (although I don't think that should be an issue).
If it crashes a third time, and I suspect it will, I'll include some
of the stack trace. Of course, part of the problem is that it is
deep enough that it goes off screen and we can't see the top of the
trace (which is kind of useful). :)
I was hoping for a silver bullet, but...
Thanks,
Charlie Taylor
UF HPC Center
On Jun 2, 2008, at 11:16 AM, Johann Lombardi wrote:
> On Mon, Jun 02, 2008 at 11:02:11AM -0400, Charles Taylor wrote:
>> We lost our MDS/MGS to a power failure yesterday evening. Just to
>> be safe, we ran e2fsck on the combined MDT/MGT and there were only a
>> couple of minor complaints about HTREE issues that it fixed. The
>> MDT/MGT now fsck's cleanly. The problem is that, despite the
>> clean
>> e2fsck, the MGS is crashing in the lustre mount code when attempting
>> to mount the MDT.
>
> Where is it crashing exactly? Any stack traces, assertion failures ...
> on the console?
>
> Johann
More information about the lustre-discuss
mailing list