[Lustre-discuss] How to read/understand a Lustre error
Ms. Megan Larko
dobsonunit at gmail.com
Wed Mar 4 09:11:50 PST 2009
Greetings,
I have a Lustre OSS with eleven (0-11) OSTs. Every once in a while
the OSS hosting the OSTs fails with a kernel panic. The system runs
CentOS 5.1 using Lustre kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp.
There are no error messages in the /var/log/messages file for March 4
prior to the message printed below. The last line in the
/var/log/messages file was a routine stamp from March 2.
How do I understand the "lock callback timer expired message below?
After the dump the system shows "kernel panic" on console and requires
a manual reboot.
Any tips and insight greatly appreciated.
megan
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Mar 4 09:42:57 oss4 kernel: LustreError:
0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### lock callback
timer expired: evicting client
9e1d3bc1-201b-4d0b-cc1a-2d52d619c937 at NET_0x50000c0a840d6_UUID nid
192.168.64.214 at o2ib ns: filter-crew8-OST0004_UUID lock:
ffff81039e9bdd80/0x99e7393d0850f39f lrc: 2/0,0 mode: PR/PR res:
1267155/0 rrc: 3 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 20 remote: 0x8a3b31c963e264e8 expref:
830 pid: 4250
Mar 4 09:42:59 oss4 kernel: LustreError:
4989:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107) req at ffff8102fdae7400 x4386790/t0 o101-><?>@<?>:-1 lens 232/0
ref 0 fl Interpret:/0/0 rc -107/0
Mar 4 09:42:59 oss4 kernel: LustreError:
4989:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 3 previous
similar messages
Mar 4 09:43:47 oss4 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb())
Watchdog triggered for pid 4206: it was inactive for 100s
Mar 4 09:43:47 oss4 kernel: Lustre:
0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for
process 4206
Mar 4 09:43:47 oss4 kernel: ll_ost_66 D 0000000000000580 0
4206 1 4207 4205 (L-TLB)
Mar 4 09:43:47 oss4 kernel: ffff810424c99618 0000000000000046
0000000000000001 0000000000000080
Mar 4 09:43:47 oss4 kernel: 000000000000000a ffff810425134860
ffffffff802dcae0 0003a25c7b1d51cd
Mar 4 09:43:47 oss4 kernel: 000000000000052d ffff810425134a48
ffffffff00000000 ffff81042ed6f7e0
Mar 4 09:43:47 oss4 kernel: Call Trace:
Mar 4 09:43:47 oss4 kernel: [<ffffffff80061bb1>]
__mutex_lock_slowpath+0x55/0x90
Mar 4 09:43:47 oss4 kernel: [<ffffffff80061bf1>] .text.lock.mutex+0x5/0x14
Mar 4 09:43:47 oss4 kernel: [<ffffffff8002d201>]
shrink_icache_memory+0x40/0x1e6
Mar 4 09:43:47 oss4 kernel: [<ffffffff8003e778>] shrink_slab+0xdc/0x153
Mar 4 09:43:47 oss4 kernel: [<ffffffff800c2cd7>] try_to_free_pages+0x189/0x275
Mar 4 09:43:47 oss4 kernel: [<ffffffff8000efd1>] __alloc_pages+0x1a8/0x2ab
Mar 4 09:43:47 oss4 kernel: [<ffffffff80017026>] cache_grow+0x137/0x395
(...etc to end of kernel panic dump)
Mar 4 09:43:48 oss4 kernel: LustreError: dumping log to
/tmp/lustre-log.1236177827.4206
A "lctl dl" after rebooting the computer:
[root at oss4 log]# lctl
lctl > dl
0 UP mgc MGC192.168.64.210 at o2ib 5df96fa8-528f-de53-c2e1-d4db598b057d 5
1 UP ost OSS OSS_uuid 3
2 UP obdfilter crew8-OST0000 crew8-OST0000_UUID 11
3 UP obdfilter crew8-OST0001 crew8-OST0001_UUID 11
4 UP obdfilter crew8-OST0002 crew8-OST0002_UUID 11
5 UP obdfilter crew8-OST0003 crew8-OST0003_UUID 11
6 UP obdfilter crew8-OST0004 crew8-OST0004_UUID 11
7 UP obdfilter crew8-OST0005 crew8-OST0005_UUID 11
8 UP obdfilter crew8-OST0006 crew8-OST0006_UUID 11
9 UP obdfilter crew8-OST0007 crew8-OST0007_UUID 11
10 UP obdfilter crew8-OST0008 crew8-OST0008_UUID 11
11 UP obdfilter crew8-OST0009 crew8-OST0009_UUID 11
12 UP obdfilter crew8-OST000a crew8-OST000a_UUID 11
13 UP obdfilter crew8-OST000b crew8-OST000b_UUID 11
The computer comes up normally without errors.
More information about the lustre-discuss
mailing list