[Lustre-discuss] How to read/understand a Lustre error

Thu Mar 5 11:26:37 PST 2009

On Mar 04, 2009  12:11 -0500, Ms. Megan Larko wrote:
> I have a Lustre OSS with eleven (0-11) OSTs.  Every once in a while
> the OSS hosting the OSTs fails with a kernel panic.  The system runs
> CentOS 5.1 using Lustre kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp.
> There are no error messages in the /var/log/messages file for March 4
> prior to the message printed below.   The last line in the
> /var/log/messages file was a routine stamp from March 2.
> 
> How do I understand the "lock callback timer expired message below?
> After the dump the system shows "kernel panic" on console and requires
> a manual reboot.

Note that a real kernel panic will NOT be written into /var/log/messages,
EVEN if you have "remote syslog" because at the time of panic or OOM
the kernel cannot write to the filesystem or generate a normal TCP request
from userspace.

If you have a serial console attached to the OSS, or if you have "netconsole"
or "netdump" configured (special network consoles that can only send
messages from the kernel using a low-level network interface) you can
capture the just-before-panic messages on another node.

> Mar  4 09:43:47 oss4 kernel:  [<ffffffff80061bb1>]
> __mutex_lock_slowpath+0x55/0x90
> Mar  4 09:43:47 oss4 kernel:  [<ffffffff80061bf1>] .text.lock.mutex+0x5/0x14
> Mar  4 09:43:47 oss4 kernel:  [<ffffffff8002d201>]
> shrink_icache_memory+0x40/0x1e6
> Mar  4 09:43:47 oss4 kernel:  [<ffffffff8003e778>] shrink_slab+0xdc/0x153
> Mar  4 09:43:47 oss4 kernel:  [<ffffffff800c2cd7>] try_to_free_pages+0x189/0x275
> Mar  4 09:43:47 oss4 kernel:  [<ffffffff8000efd1>] __alloc_pages+0x1a8/0x2ab
> Mar  4 09:43:47 oss4 kernel:  [<ffffffff80017026>] cache_grow+0x137/0x395

This looks like a memory allocation problem/deadlock.  With 11 OSTs on
a single OSS that means (by default, if you didn't specify otherwise)
11 * 400 MB = 4400MB = 4.4GB of memory JUST for the journals of those 
filesystems.  If your OSS doesn't have at least that much RAM (preferrably
about 16GB of RAM) it will not be usable under heavy load.

I would ask why you have so many OSTs on a single OSS?  Is this a system
where you need a lot of capacity, but not much performance?  Are the
OSTs smaller than the maximum possible size (8TB) and could be merged
together?  Having fewer OSTs is usually best because it reduces per-OST
overhead like the journal, and also avoids free space fragmentation.

If you really have 11x 8TB OSTs (or near that limit) and you want to
keep them all on the OSS with not enough RAM, but don't expect peak
performance out of all OSTs at the same time, or have relatively few
clients, then I would suggest to reduce the journal size to 128MB
(=1408MB total) or 64MB (=704MB), which can be done on a cleanly
unmounted filesystem.

root> umount /dev/{ostdev}
root> e2fsck /dev/{ostdev}
root> tune2fs -O ^has_journal /dev/{ostdev}
root> tune2fs -j -J size=128 /dev/{ostdev}    # or "-J size=64"
root> mount /dev/{ostdev}

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.