[Lustre-discuss] How to read/understand a Lustre error

Brian J. Murrell Brian.Murrell at Sun.COM
Wed Mar 4 09:29:13 PST 2009


On Wed, 2009-03-04 at 12:11 -0500, Ms. Megan Larko wrote:
> Greetings,
> 
> I have a Lustre OSS with eleven (0-11) OSTs.  Every once in a while
> the OSS hosting the OSTs fails with a kernel panic.

To be clear, what you are reporting is not a kernel panic.  It is a
watchdog timeout.  Kernel panics halt the machine.  Watchdog timeouts do
not, although both will print stack traces so they are easily confused.

>   The system runs
> CentOS 5.1 using Lustre kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp.

I would suggest upgrading to 1.6.7.  We fix quite a number of bug with
each point release we do and given that you 3 behind, that is a lot of
bugs.

> How do I understand the "lock callback timer expired message below?

A client was requested to give back a lock it held and timed out doing
so.  Usually indicates a bug or a network failure.

> After the dump the system shows "kernel panic" on console and requires
> a manual reboot.

Maybe you are getting a panic, but there is no evidence of that in what
you pasted below, just the watchdog timeout and its stack trace.

> Any tips and insight greatly appreciated.

Really.  If at all possible, upgrade.

b.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090304/4b9c6bea/attachment.pgp>


More information about the lustre-discuss mailing list