[Lustre-discuss] Lustre Mount Crashing

Mon Jun 2 08:35:35 PDT 2008

Well, I figured someone would ask that.  :)    The last messages that  
make it to syslog prior to the crash are....

Jun  2 10:29:54 hpcmds kernel: LDISKFS FS on md2, internal journal
Jun  2 10:29:54 hpcmds kernel: LDISKFS-fs: recovery complete.
Jun  2 10:29:54 hpcmds kernel: LDISKFS-fs: mounted filesystem with  
ordered data mode.
Jun  2 10:29:54 hpcmds kernel: kjournald starting.  Commit interval 5  
seconds
Jun  2 10:29:54 hpcmds kernel: LDISKFS FS on md2, internal journal
Jun  2 10:29:54 hpcmds kernel: LDISKFS-fs: mounted filesystem with  
ordered data mode.
Jun  2 10:29:54 hpcmds kernel: Lustre: MGS MGS started
Jun  2 10:29:54 hpcmds kernel: Lustre: Enabling user_xattr
Jun  2 10:29:54 hpcmds kernel: Lustre: 4540:0:(mds_fs.c: 
446:mds_init_server_data()) RECOVERY: service ufhpc-MDT0000, 100  
recoverable clients, last_transno 9412464331
Jun  2 10:29:54 hpcmds kernel: Lustre: MDT ufhpc-MDT0000 now serving  
dev (ufhpc-MDT0000/cac99db5-a66a-a6ac-4649-6ec8cc2dc0e7), but will be  
in recovery until 100 clients reconnect, or if no clients reconnect  
for 4:10; during that time new clients will not be allowed to connect.  
Recovery progress can be monitored by watching /proc/fs/lustre/mds/ 
ufhpc-MDT0000/recovery_status.
Jun  2 10:29:55 hpcmds kernel: Lustre: 4540:0:(mds_lov.c: 
858:mds_notify()) MDS ufhpc-MDT0000: in recovery, not resetting  
orphans on ufhpc-OST0004_UUID
Jun  2 10:29:55 hpcmds kernel: Lustre: 4540:0:(mds_lov.c: 
858:mds_notify()) MDS ufhpc-MDT0000: in recovery, not resetting  
orphans on ufhpc-OST0005_UUID

Note that all of the clients are powered off and the OSS's are  
currently unmounted (though they appear to be fine).

Unfortunately, getting the messages off the console (in the machine  
room) means using a pencil and paper (you'd think we have something as  
fancy as a ip-kvm console server, but alas, we do things, ahem,  
"inexpensively" here.   I'm going to let the md mirrors resync before  
I try it again (although I don't think that should be an issue).      
If it crashes a third time, and I suspect it will, I'll include some  
of the stack trace.   Of course, part of the problem is that it is  
deep enough that it goes off screen and we can't see the top of the  
trace (which is kind of useful).  :)

I was hoping for a silver bullet, but...

Thanks,

Charlie Taylor
UF HPC Center

On Jun 2, 2008, at 11:16 AM, Johann Lombardi wrote:

> On Mon, Jun 02, 2008 at 11:02:11AM -0400, Charles Taylor wrote:
>> We lost our MDS/MGS to a power failure yesterday evening.     Just to
>> be safe, we ran e2fsck on the combined MDT/MGT and there were only a
>> couple of minor complaints about HTREE issues that it fixed.    The
>> MDT/MGT now fsck's cleanly.     The problem is that, despite the  
>> clean
>> e2fsck, the MGS is crashing in the lustre mount code when attempting
>> to mount the MDT.
>
> Where is it crashing exactly? Any stack traces, assertion failures ...
> on the console?
>
> Johann