[Lustre-discuss] OST denied reconnect to MGS; kernel panic led to system crash

Ms. Megan Larko dobsonunit at gmail.com
Fri Feb 20 11:10:32 PST 2009


Hello,

I had an unusual event on my Lustre OSS computer today.   At 6:20 a.m.
there seemed to be some sort of communication snafu.   One of several
OST's on my OSS would not communicate with the MGS (the other OSTs did
not generate any such communication error).   The error seemed to lead
to a kernel panic and a crash.

I am attaching the February 20 sections of the OSS /var/log/messages
file:  lustre.error.20Feb09.gz

The system tried to communicate with the OST crew8-OST0009 a couple of
times.   Then some sort of system memory error seemed to have
occurred.  The Lustre error number was -16 which I did not see in
http://manual.lustre.org/manual/LustreManual16_HTML/LustreTroubleshootingTips.html.
  I could not find -16 in any of the following errno.h files I
checked:

[root at oss4 log]# vi /usr/include/asm-x86_64/errno.h
[root at oss4 log]# vi /usr/include/errno.h
[root at oss4 log]# vi /usr/include/asm/errno.h
[root at oss4 log]# vi /usr/include/linux/errno.h
[root at oss4 log]# vi /usr/include/sys/errno.h

The drives are contained in 16-bay JBODs connected to a server via
LSI.1078 card.   The LSI utility MegaCli64 gave no indication of any
errors on either the card or any of the drives.

The OSS in question did reboot and (eventually) recover all the disks
on its own.

This is 2.6.18-53.1.13.el5_lustre.1.6.4.3smp on CentOS 5.

Any insights?     Thoughts welcome.

megan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre.error.20Feb09.gz
Type: application/x-gzip
Size: 18904 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090220/8143d4d0/attachment.bin>


More information about the lustre-discuss mailing list