[Lustre-discuss] How to read/understand a Lustre error

Ms. Megan Larko dobsonunit at gmail.com
Wed Mar 4 09:11:50 PST 2009


Greetings,

I have a Lustre OSS with eleven (0-11) OSTs.  Every once in a while
the OSS hosting the OSTs fails with a kernel panic.  The system runs
CentOS 5.1 using Lustre kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp.
There are no error messages in the /var/log/messages file for March 4
prior to the message printed below.   The last line in the
/var/log/messages file was a routine stamp from March 2.

How do I understand the "lock callback timer expired message below?
After the dump the system shows "kernel panic" on console and requires
a manual reboot.

Any tips and insight greatly appreciated.

megan

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Mar  4 09:42:57 oss4 kernel: LustreError:
0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### lock callback
timer expired: evicting client
9e1d3bc1-201b-4d0b-cc1a-2d52d619c937 at NET_0x50000c0a840d6_UUID nid
192.168.64.214 at o2ib  ns: filter-crew8-OST0004_UUID lock:
ffff81039e9bdd80/0x99e7393d0850f39f lrc: 2/0,0 mode: PR/PR res:
1267155/0 rrc: 3 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 20 remote: 0x8a3b31c963e264e8 expref:
830 pid: 4250
Mar  4 09:42:59 oss4 kernel: LustreError:
4989:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at ffff8102fdae7400 x4386790/t0 o101-><?>@<?>:-1 lens 232/0
ref 0 fl Interpret:/0/0 rc -107/0
Mar  4 09:42:59 oss4 kernel: LustreError:
4989:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 3 previous
similar messages
Mar  4 09:43:47 oss4 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb())
Watchdog triggered for pid 4206: it was inactive for 100s
Mar  4 09:43:47 oss4 kernel: Lustre:
0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for
process 4206
Mar  4 09:43:47 oss4 kernel: ll_ost_66     D 0000000000000580     0
4206      1          4207  4205 (L-TLB)
Mar  4 09:43:47 oss4 kernel:  ffff810424c99618 0000000000000046
0000000000000001 0000000000000080
Mar  4 09:43:47 oss4 kernel:  000000000000000a ffff810425134860
ffffffff802dcae0 0003a25c7b1d51cd
Mar  4 09:43:47 oss4 kernel:  000000000000052d ffff810425134a48
ffffffff00000000 ffff81042ed6f7e0
Mar  4 09:43:47 oss4 kernel: Call Trace:
Mar  4 09:43:47 oss4 kernel:  [<ffffffff80061bb1>]
__mutex_lock_slowpath+0x55/0x90
Mar  4 09:43:47 oss4 kernel:  [<ffffffff80061bf1>] .text.lock.mutex+0x5/0x14
Mar  4 09:43:47 oss4 kernel:  [<ffffffff8002d201>]
shrink_icache_memory+0x40/0x1e6
Mar  4 09:43:47 oss4 kernel:  [<ffffffff8003e778>] shrink_slab+0xdc/0x153
Mar  4 09:43:47 oss4 kernel:  [<ffffffff800c2cd7>] try_to_free_pages+0x189/0x275
Mar  4 09:43:47 oss4 kernel:  [<ffffffff8000efd1>] __alloc_pages+0x1a8/0x2ab
Mar  4 09:43:47 oss4 kernel:  [<ffffffff80017026>] cache_grow+0x137/0x395
(...etc to end of kernel panic dump)

Mar  4 09:43:48 oss4 kernel: LustreError: dumping log to
/tmp/lustre-log.1236177827.4206

A "lctl dl" after rebooting the computer:
[root at oss4 log]# lctl
lctl > dl
  0 UP mgc MGC192.168.64.210 at o2ib 5df96fa8-528f-de53-c2e1-d4db598b057d 5
  1 UP ost OSS OSS_uuid 3
  2 UP obdfilter crew8-OST0000 crew8-OST0000_UUID 11
  3 UP obdfilter crew8-OST0001 crew8-OST0001_UUID 11
  4 UP obdfilter crew8-OST0002 crew8-OST0002_UUID 11
  5 UP obdfilter crew8-OST0003 crew8-OST0003_UUID 11
  6 UP obdfilter crew8-OST0004 crew8-OST0004_UUID 11
  7 UP obdfilter crew8-OST0005 crew8-OST0005_UUID 11
  8 UP obdfilter crew8-OST0006 crew8-OST0006_UUID 11
  9 UP obdfilter crew8-OST0007 crew8-OST0007_UUID 11
 10 UP obdfilter crew8-OST0008 crew8-OST0008_UUID 11
 11 UP obdfilter crew8-OST0009 crew8-OST0009_UUID 11
 12 UP obdfilter crew8-OST000a crew8-OST000a_UUID 11
 13 UP obdfilter crew8-OST000b crew8-OST000b_UUID 11

The computer comes up normally without errors.



More information about the lustre-discuss mailing list