[Lustre-discuss] MDT Panic

Timh Bergström timh.bergstrom at diino.net
Mon Nov 17 07:41:17 PST 2008


I solved it with mounting the mdt then the ost's in order to get them
all reconnected and then waiting out the recovery period before
mounting the clients.

Im still curious to what it was though.

//T

2008/11/17 Timh Bergström <timh.bergstrom at diino.net>:
> Hi,
>
> Last night one of our MDT servers paniced with the following errors:
>
> LustreError: 26312:0:(llog_server.c:433:llog_origin_handle_cancel())
> cancel 124 llog-records failed: -22
> Lustre: diinofs-OST000a-osc: Connection to service diinofs-OST000a via
> nid 10.4.22.213 at tcp was lost; in progress operations using this
> service will wait for recovery to complete.
> Lustre: 4588:0:(quota_master.c:1576:mds_quota_recovery()) Not all osts
> are active, abort quota recovery
> Lustre: diinofs-OST000a-osc: Connection restored to service
> diinofs-OST000a using nid 10.4.22.213 at tcp.
> Lustre: MDS diinofs-MDT0000: diinofs-OST000a_UUID now active, resetting orphans
> LustreError: 6823:0:(llog_server.c:433:llog_origin_handle_cancel())
> cancel 124 llog-records failed: -22
> Lustre: diinofs-OST000b-osc: Connection to service diinofs-OST000b via
> nid 10.4.22.213 at tcp was lost; in progress operations using this
> service will wait for recovery to complete.
> Lustre: diinofs-OST000b-osc: Connection restored to service
> diinofs-OST000b using nid 10.4.22.213 at tcp.
> Lustre: MDS diinofs-MDT0000: diinofs-OST000b_UUID now active, resetting orphans
> LustreError: 6822:0:(llog_server.c:433:llog_origin_handle_cancel())
> cancel 124 llog-records failed: -22
> process `syslogd' is using obsolete setsockopt SO_BSDCOMPAT
> LustreError: 6844:0:(mds_reint.c:1528:mds_orphan_add_link())
> ASSERTION(inode->i_nlink == 1) failed:dir nlink == 0
> LustreError: 6844:0:(mds_reint.c:1528:mds_orphan_add_link()) LBUG
> Lustre: 6844:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing
> stack for process 6844
> ll_mdt_07     R  running task       0  6844      1          6845  6843 (L-TLB)
>  0000003000000030 ffff810113b876b0 ffff810113b875d0 0000000000000001
>  0000000000000002 ffffffff802ecde9 3830326563646539 3338333033323635
>  ffff810008afb000 0000000000000000 0000000000000000 ffffffff8032875e
> Call Trace:
>  [<ffffffff8032875e>] scrup+0x60/0xc5
>  [<ffffffff803287df>] lf+0x1c/0x3a
>  [<ffffffff8032c51a>] vt_console_print+0x213/0x228
>  [<ffffffff8032c51a>] vt_console_print+0x213/0x228
>  [<ffffffff80228520>] printk+0x4e/0x56
>  [<ffffffff80228520>] printk+0x4e/0x56
>  [<ffffffff802431d5>] kallsyms_lookup+0xe7/0x1af
>  [<ffffffff802431d5>] kallsyms_lookup+0xe7/0x1af
>  [<ffffffff8020ac42>] printk_address+0x9f/0xac
>  [<ffffffff8032875e>] scrup+0x60/0xc5
>  [<ffffffff802403b7>] module_text_address+0x33/0x3b
>  [<ffffffff80238a51>] kernel_text_address+0x1a/0x26
>  [<ffffffff80238a51>] kernel_text_address+0x1a/0x26
>  [<ffffffff8020ae48>] show_trace+0x1f9/0x21f
>  [<ffffffff8020af67>] _show_stack+0xe2/0xf1
>  [<ffffffff88234b8a>] :libcfs:lbug_with_loc+0x7a/0xc0
>  [<ffffffff88535021>] :mds:mds_orphan_add_link+0x641/0x7e0
>  [<ffffffff884aa02d>] :ldiskfs:__ldiskfs_journal_stop+0x2d/0x60
>  [<ffffffff8029b1ea>] dnotify_parent+0x1c/0x6e
>  [<ffffffff80282ec5>] dput+0x23/0x152
>  [<ffffffff8853fae5>] :mds:mds_reint_unlink+0x1665/0x23b0
>  [<ffffffff8835608a>] :ptlrpc:ptlrpc_set_destroy+0x26a/0x2d0
>  [<ffffffff8835bbc3>] :ptlrpc:ptl_send_buf+0x3f3/0x5b0
>  [<ffffffff80282ec5>] dput+0x23/0x152
>  [<ffffffff88534909>] :mds:mds_reint_rec+0x1d9/0x2b0
>  [<ffffffff88559cfc>] :mds:mds_unlink_unpack+0x28c/0x3b0
>  [<ffffffff88558c8f>] :mds:mds_update_unpack+0x20f/0x2b0
>  [<ffffffff8832572f>] :ptlrpc:ldlm_reprocess_all+0x17f/0x210
>  [<ffffffff8852745a>] :mds:mds_reint+0x35a/0x420
>  [<ffffffff88364eb9>] :ptlrpc:lustre_pack_reply+0x29/0xb0
>  [<ffffffff8836028f>] :ptlrpc:lustre_msg_get_opc+0x4f/0x100
>  [<ffffffff8852dd2b>] :mds:mds_handle+0x247b/0x4c30
>  [<ffffffff802206c2>] find_busiest_group+0x20d/0x634
>  [<ffffffff882c8c29>] :obdclass:class_handle2object+0xd9/0x160
>  [<ffffffff8835f99f>] :ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100
>  [<ffffffff88369dea>] :ptlrpc:ptlrpc_check_req+0x1a/0x110
>  [<ffffffff8836bf39>] :ptlrpc:ptlrpc_server_handle_request+0x999/0x1030
>  [<ffffffff8020d490>] do_gettimeofday+0x50/0x94
>  [<ffffffff8823d326>] :libcfs:lcw_update_time+0x16/0x100
>  [<ffffffff8021f971>] __wake_up_common+0x3e/0x68
>  [<ffffffff8836ee1b>] :ptlrpc:ptlrpc_main+0xd0b/0xe80
>  [<ffffffff802213f2>] default_wake_function+0x0/0xe
>  [<ffffffff8020a6a4>] child_rip+0xa/0x12
>  [<ffffffff8836e110>] :ptlrpc:ptlrpc_main+0x0/0xe80
>  [<ffffffff8020a69a>] child_rip+0x0/0x12
>
> LustreError: dumping log to /tmp/lustre-log.1226867690.6844
> Lustre: 6855:0:(ldlm_lib.c:525:target_handle_reconnect())
> diinofs-MDT0000: ec04ce81-b4df-720d-47a0-ca676c240a24 reconnecting
> Lustre: 6855:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 60
> previous similar messages
> Lustre: 6855:0:(ldlm_lib.c:760:target_handle_connect())
> diinofs-MDT0000: refuse reconnection from
> ec04ce81-b4df-720d-47a0-ca676c240a24 at 10.4.31.103@tcp to
> 0xffff81021a606000; still busy with 2 active RPCs
> Lustre: 6855:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 50
> previous similar messages
> LustreError: 6855:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@
> processing error (-16)  req at ffff8102214e5400 x8763679/t0
> o38->ec04ce81-b4df-720d-47a0-ca676c240a24 at NET_0x200000a041f67_UUID:0/0
> lens 304/200 e 0 to 0 dl 1226868090 ref 1 fl Interpret:/0/0 rc -16/0
> LustreError: 6855:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped
> 46 previous similar messages
> Lustre: 6864:0:(ldlm_lib.c:525:target_handle_reconnect())
> diinofs-MDT0000: bed5bd81-726a-ebda-b9c7-2dfe270f90c0 reconnecting
> Lustre: 6864:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 13
> previous similar messages
> Lustre: 6864:0:(ldlm_lib.c:760:target_handle_connect())
> diinofs-MDT0000: refuse reconnection from
> bed5bd81-726a-ebda-b9c7-2dfe270f90c0 at 10.4.31.114@tcp to
> 0xffff8101192bc000; still busy with 3 active RPCs
> Lustre: 6864:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 13
> previous similar messages
> LustreError: 6864:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@
> processing error (-16)  req at ffff8101139edc00 x12195715/t0
> o38->bed5bd81-726a-ebda-b9c7-2dfe270f90c0 at NET_0x200000a041f72_UUID:0/0
> lens 304/200 e 0 to 0 dl 1226868167 ref 1 fl Interpret:/0/0 rc -16/0
> LustreError: 6864:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped
> 13 previous similar messages
>
> What does it mean? Now we can't reboot either of the MDT's without
> panics (DRBD synced meta-data). I see a "LBUG" error but that shouldnt
> panic the system after a complete cluster reboot(?) - we are running
> lustre 1.6.5.1 on Debian Etch with 2.6.18 vanilla kernel, the clients
> are running patchless with Debian Etch 2.6.18 aswell. The strange
> thing is that it has been working for about 8 months already but now
> suddenly we got alot of small errors both from oss'es and mdt's. The
> panic is most urgent to solve though. Any ideas?
>
> --
> Regards,
> Timh Bergström
> Server Operations Manager
> Diino AB - www.diino.com
> :wq
>



-- 
Timh Bergström
System Operations Manager
Diino AB - www.diino.com
:wq



More information about the lustre-discuss mailing list