[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

Hendelman, Rob Rob.Hendelman at magnetar.com
Wed Apr 8 11:24:39 PDT 2009


I forgot to mention, that about an hour after this started, we got the
following:
================
Apr  7 18:49:06 maglustre04 kernel: Lustre:
6483:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0000:
fs01-mdtlov_UUID reconnecting
Apr  7 18:49:06 maglustre04 kernel: Lustre:
6483:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 111 previous
similar messages
Apr  7 18:49:26 maglustre04 kernel: Lustre:
5075:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request
x1039597 took longer than estimated (50+75s); client may timeout
.  req at ffff8104bb35e400 x1039597/t0 o5->fs01-mdtlov_UUID@:0/0 lens
336/336 e 0 to 0 dl 1239148091 ref 1 fl Complete:/0/0 rc 0/0
Apr  7 18:51:58 maglustre04 kernel: Lustre:
5075:0:(service.c:1317:ptlrpc_server_handle_request()) Skipped 12
previous similar messages
Apr  7 18:51:58 maglustre04 kernel: Lustre: fs01-OST0006: received MDS
connection from 10.5.10.11 at tcp
Apr  7 18:51:58 maglustre04 kernel: Lustre: Skipped 10 previous similar
messages
Apr  7 18:51:58 maglustre04 kernel: LustreError:
5146:0:(lustre_fsfilt.h:229:fsfilt_start_log()) fs01-OST0008: slow
journal start 135s
Apr  7 18:51:58 maglustre04 kernel: Lustre:
462:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 5024: it was
inactive for 200s
Apr  7 18:51:58 maglustre04 kernel: Lustre:
462:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for
process 5024
Apr  7 18:51:58 maglustre04 kernel: ll_ost_02     D ffff8102687a20c0
0  5024      1          5025  5023 (L-TLB)
Apr  7 18:51:58 maglustre04 kernel:  ffff81027a69f5a8 0000000000000046
0000000000000000 00000000ffffffff
Apr  7 18:51:58 maglustre04 kernel:  0000000000000000 0000000000000001
ffff8102687a20c0 ffff8105272907a0
Apr  7 18:51:58 maglustre04 kernel:  0004a4ba546cd448 0000000000009669
ffff8102687a22a8 000000008882434d
Apr  7 18:51:58 maglustre04 kernel: Call Trace:
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80046dac>]
sprintf+0x51/0x59
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80063be0>]
__mutex_lock_slowpath+0x60/0x9b
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80063c20>]
.text.lock.mutex+0x5/0x14
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff887fe756>]
:obdfilter:filter_parent_lock+0x36/0x220
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8009ddc3>]
autoremove_wake_function+0x9/0x2e
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff800891f6>]
__wake_up_common+0x3e/0x68
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff888026cf>]
:obdfilter:filter_fid2dentry+0x2cf/0x730
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88652b72>]
:lquota:filter_quota_adjust+0x172/0x2a0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8000d046>]
dput+0x23/0x10a
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88809fe8>]
:obdfilter:filter_destroy+0x148/0x1dd0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88546520>]
:ptlrpc:ldlm_blocking_ast+0x0/0x2a0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88549db0>]
:ptlrpc:ldlm_completion_ast+0x0/0x7c0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff800646b6>]
__down_failed_trylock+0x35/0x3a
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8880c7ce>]
:obdfilter:filter_create+0xb5e/0x1530
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8856f004>]
:ptlrpc:lustre_msg_set_timeout+0x34/0x110
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88572f09>]
:ptlrpc:lustre_pack_reply+0x29/0xb0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff887cd01f>]
:ost:ost_handle+0x136f/0x5cd0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80143809>]
__next_cpu+0x19/0x28
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80143809>]
__next_cpu+0x19/0x28
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff800898e3>]
find_busiest_group+0x20d/0x621
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8856d5a5>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88575cfa>]
:ptlrpc:ptlrpc_server_request_get+0x6a/0x150
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88577b7d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8857a103>]
:ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80062f4b>]
thread_return+0x0/0xdf
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8006d8a2>]
do_gettimeofday+0x40/0x8f
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8841d7c6>]
:libcfs:lcw_update_time+0x16/0x100
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff800891f6>]
__wake_up_common+0x3e/0x68
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8857d5f8>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8008abb9>]
default_wake_function+0x0/0xe
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff800b4382>]
audit_syscall_exit+0x31b/0x336
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8005dfb1>]
child_rip+0xa/0x11
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8857c3e0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8005dfa7>]
child_rip+0x0/0x11
Apr  7 18:51:58 maglustre04 kernel:
==================

Are these due to watchdog timeouts ?  

Eventually (at just after 8pm or so) while we were looking at the logs,
the box oom'ed on us & started killing stuff.  At that point we
rebooted, e2fsck'd and remounted & things have been ok since.  Clients
went through their recovery and came back functional.

Robert Hendelman Jr
Magnetar Capital LLC
Rob.Hendelman at magnetar.com
1-847-905-4557



The information contained in this message and its attachments 
is intended only for the private and confidential use of the 
intended recipient(s).  If you are not the intended recipient 
(or have received this e-mail in error) please notify the 
sender immediately and destroy this e-mail. Any unauthorized 
copying, disclosure or distribution of the material in this e-
mail is strictly prohibited.



More information about the lustre-discuss mailing list