[Lustre-discuss] How to read/understand a Lustre error
Brian J. Murrell
Brian.Murrell at Sun.COM
Wed Mar 4 09:29:13 PST 2009
On Wed, 2009-03-04 at 12:11 -0500, Ms. Megan Larko wrote:
> Greetings,
>
> I have a Lustre OSS with eleven (0-11) OSTs. Every once in a while
> the OSS hosting the OSTs fails with a kernel panic.
To be clear, what you are reporting is not a kernel panic. It is a
watchdog timeout. Kernel panics halt the machine. Watchdog timeouts do
not, although both will print stack traces so they are easily confused.
> The system runs
> CentOS 5.1 using Lustre kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp.
I would suggest upgrading to 1.6.7. We fix quite a number of bug with
each point release we do and given that you 3 behind, that is a lot of
bugs.
> How do I understand the "lock callback timer expired message below?
A client was requested to give back a lock it held and timed out doing
so. Usually indicates a bug or a network failure.
> After the dump the system shows "kernel panic" on console and requires
> a manual reboot.
Maybe you are getting a panic, but there is no evidence of that in what
you pasted below, just the watchdog timeout and its stack trace.
> Any tips and insight greatly appreciated.
Really. If at all possible, upgrade.
b.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090304/4b9c6bea/attachment.pgp>
More information about the lustre-discuss
mailing list