[Lustre-discuss] MDT Panic

Sun Nov 16 23:28:24 PST 2008

Hi,

Last night one of our MDT servers paniced with the following errors:

LustreError: 26312:0:(llog_server.c:433:llog_origin_handle_cancel())
cancel 124 llog-records failed: -22
Lustre: diinofs-OST000a-osc: Connection to service diinofs-OST000a via
nid 10.4.22.213 at tcp was lost; in progress operations using this
service will wait for recovery to complete.
Lustre: 4588:0:(quota_master.c:1576:mds_quota_recovery()) Not all osts
are active, abort quota recovery
Lustre: diinofs-OST000a-osc: Connection restored to service
diinofs-OST000a using nid 10.4.22.213 at tcp.
Lustre: MDS diinofs-MDT0000: diinofs-OST000a_UUID now active, resetting orphans
LustreError: 6823:0:(llog_server.c:433:llog_origin_handle_cancel())
cancel 124 llog-records failed: -22
Lustre: diinofs-OST000b-osc: Connection to service diinofs-OST000b via
nid 10.4.22.213 at tcp was lost; in progress operations using this
service will wait for recovery to complete.
Lustre: diinofs-OST000b-osc: Connection restored to service
diinofs-OST000b using nid 10.4.22.213 at tcp.
Lustre: MDS diinofs-MDT0000: diinofs-OST000b_UUID now active, resetting orphans
LustreError: 6822:0:(llog_server.c:433:llog_origin_handle_cancel())
cancel 124 llog-records failed: -22
process `syslogd' is using obsolete setsockopt SO_BSDCOMPAT
LustreError: 6844:0:(mds_reint.c:1528:mds_orphan_add_link())
ASSERTION(inode->i_nlink == 1) failed:dir nlink == 0
LustreError: 6844:0:(mds_reint.c:1528:mds_orphan_add_link()) LBUG
Lustre: 6844:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing
stack for process 6844
ll_mdt_07     R  running task       0  6844      1          6845  6843 (L-TLB)
 0000003000000030 ffff810113b876b0 ffff810113b875d0 0000000000000001
 0000000000000002 ffffffff802ecde9 3830326563646539 3338333033323635
 ffff810008afb000 0000000000000000 0000000000000000 ffffffff8032875e
Call Trace:
 [<ffffffff8032875e>] scrup+0x60/0xc5
 [<ffffffff803287df>] lf+0x1c/0x3a
 [<ffffffff8032c51a>] vt_console_print+0x213/0x228
 [<ffffffff8032c51a>] vt_console_print+0x213/0x228
 [<ffffffff80228520>] printk+0x4e/0x56
 [<ffffffff80228520>] printk+0x4e/0x56
 [<ffffffff802431d5>] kallsyms_lookup+0xe7/0x1af
 [<ffffffff802431d5>] kallsyms_lookup+0xe7/0x1af
 [<ffffffff8020ac42>] printk_address+0x9f/0xac
 [<ffffffff8032875e>] scrup+0x60/0xc5
 [<ffffffff802403b7>] module_text_address+0x33/0x3b
 [<ffffffff80238a51>] kernel_text_address+0x1a/0x26
 [<ffffffff80238a51>] kernel_text_address+0x1a/0x26
 [<ffffffff8020ae48>] show_trace+0x1f9/0x21f
 [<ffffffff8020af67>] _show_stack+0xe2/0xf1
 [<ffffffff88234b8a>] :libcfs:lbug_with_loc+0x7a/0xc0
 [<ffffffff88535021>] :mds:mds_orphan_add_link+0x641/0x7e0
 [<ffffffff884aa02d>] :ldiskfs:__ldiskfs_journal_stop+0x2d/0x60
 [<ffffffff8029b1ea>] dnotify_parent+0x1c/0x6e
 [<ffffffff80282ec5>] dput+0x23/0x152
 [<ffffffff8853fae5>] :mds:mds_reint_unlink+0x1665/0x23b0
 [<ffffffff8835608a>] :ptlrpc:ptlrpc_set_destroy+0x26a/0x2d0
 [<ffffffff8835bbc3>] :ptlrpc:ptl_send_buf+0x3f3/0x5b0
 [<ffffffff80282ec5>] dput+0x23/0x152
 [<ffffffff88534909>] :mds:mds_reint_rec+0x1d9/0x2b0
 [<ffffffff88559cfc>] :mds:mds_unlink_unpack+0x28c/0x3b0
 [<ffffffff88558c8f>] :mds:mds_update_unpack+0x20f/0x2b0
 [<ffffffff8832572f>] :ptlrpc:ldlm_reprocess_all+0x17f/0x210
 [<ffffffff8852745a>] :mds:mds_reint+0x35a/0x420
 [<ffffffff88364eb9>] :ptlrpc:lustre_pack_reply+0x29/0xb0
 [<ffffffff8836028f>] :ptlrpc:lustre_msg_get_opc+0x4f/0x100
 [<ffffffff8852dd2b>] :mds:mds_handle+0x247b/0x4c30
 [<ffffffff802206c2>] find_busiest_group+0x20d/0x634
 [<ffffffff882c8c29>] :obdclass:class_handle2object+0xd9/0x160
 [<ffffffff8835f99f>] :ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100
 [<ffffffff88369dea>] :ptlrpc:ptlrpc_check_req+0x1a/0x110
 [<ffffffff8836bf39>] :ptlrpc:ptlrpc_server_handle_request+0x999/0x1030
 [<ffffffff8020d490>] do_gettimeofday+0x50/0x94
 [<ffffffff8823d326>] :libcfs:lcw_update_time+0x16/0x100
 [<ffffffff8021f971>] __wake_up_common+0x3e/0x68
 [<ffffffff8836ee1b>] :ptlrpc:ptlrpc_main+0xd0b/0xe80
 [<ffffffff802213f2>] default_wake_function+0x0/0xe
 [<ffffffff8020a6a4>] child_rip+0xa/0x12
 [<ffffffff8836e110>] :ptlrpc:ptlrpc_main+0x0/0xe80
 [<ffffffff8020a69a>] child_rip+0x0/0x12

LustreError: dumping log to /tmp/lustre-log.1226867690.6844
Lustre: 6855:0:(ldlm_lib.c:525:target_handle_reconnect())
diinofs-MDT0000: ec04ce81-b4df-720d-47a0-ca676c240a24 reconnecting
Lustre: 6855:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 60
previous similar messages
Lustre: 6855:0:(ldlm_lib.c:760:target_handle_connect())
diinofs-MDT0000: refuse reconnection from
ec04ce81-b4df-720d-47a0-ca676c240a24 at 10.4.31.103@tcp to
0xffff81021a606000; still busy with 2 active RPCs
Lustre: 6855:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 50
previous similar messages
LustreError: 6855:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@
processing error (-16)  req at ffff8102214e5400 x8763679/t0
o38->ec04ce81-b4df-720d-47a0-ca676c240a24 at NET_0x200000a041f67_UUID:0/0
lens 304/200 e 0 to 0 dl 1226868090 ref 1 fl Interpret:/0/0 rc -16/0
LustreError: 6855:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped
46 previous similar messages
Lustre: 6864:0:(ldlm_lib.c:525:target_handle_reconnect())
diinofs-MDT0000: bed5bd81-726a-ebda-b9c7-2dfe270f90c0 reconnecting
Lustre: 6864:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 13
previous similar messages
Lustre: 6864:0:(ldlm_lib.c:760:target_handle_connect())
diinofs-MDT0000: refuse reconnection from
bed5bd81-726a-ebda-b9c7-2dfe270f90c0 at 10.4.31.114@tcp to
0xffff8101192bc000; still busy with 3 active RPCs
Lustre: 6864:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 13
previous similar messages
LustreError: 6864:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@
processing error (-16)  req at ffff8101139edc00 x12195715/t0
o38->bed5bd81-726a-ebda-b9c7-2dfe270f90c0 at NET_0x200000a041f72_UUID:0/0
lens 304/200 e 0 to 0 dl 1226868167 ref 1 fl Interpret:/0/0 rc -16/0
LustreError: 6864:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped
13 previous similar messages

What does it mean? Now we can't reboot either of the MDT's without
panics (DRBD synced meta-data). I see a "LBUG" error but that shouldnt
panic the system after a complete cluster reboot(?) - we are running
lustre 1.6.5.1 on Debian Etch with 2.6.18 vanilla kernel, the clients
are running patchless with Debian Etch 2.6.18 aswell. The strange
thing is that it has been working for about 8 months already but now
suddenly we got alot of small errors both from oss'es and mdt's. The
panic is most urgent to solve though. Any ideas?

-- 
Regards,
Timh Bergström
Server Operations Manager
Diino AB - www.diino.com
:wq