[Lustre-discuss] System Deadlock
Roger Spellman
Roger.Spellman at terascala.com
Wed Aug 17 13:23:00 PDT 2011
Hi,
I am in the process of porting Lustre client 1.8.4 to a recent kernel,
2.6.38.8. This has been a challenge for a variety of reasons, such as
the dcache_lock being removed from the kernel. I am pretty close to
having it working, but I still can generate system lockup. So far, I
have only tested Lustre over TCP over ethernet.
Using magic-sysrq, I was able to get a call trace. But, I am not sure
if I fully understand it. Can someone please validate my analysis?
Here is one of the running threads:
rm R running task 0 2039 2030 0x00000088
ffff88011cfbb578 ffffffffa059c8df 0000000000000000 0000000000000000
ffff88011caf8880 000000000001ce7d 00000000000000c1 0000000000000000
ffff88011cfbb548 ffffffffa06bfe34 0000000000000000 ffff88011ce9c800
Call Trace:
[<ffffffffa059c8df>] ? LNetMDUnlink+0x6f/0x110 [lnet]
[<ffffffffa06bfe34>] ? lustre_msg_get_slv+0x94/0x100 [ptlrpc]
[<ffffffff8104ce53>] ? __wake_up+0x53/0x70
[<ffffffffa0698c29>] ? ldlm_completion_ast+0x349/0x8d0 [ptlrpc]
[<ffffffffa067af48>] ? ldlm_lock_enqueue+0x228/0xbb0 [ptlrpc]
[<ffffffffa0675148>] ? lock_res_and_lock+0x58/0xe0 [ptlrpc]
[<ffffffffa067b9fd>] ? ldlm_lock_change_resource+0x12d/0x3f0 [ptlrpc]
[<ffffffffa067e6f9>] ? ldlm_resource_get+0xe9/0xc00 [ptlrpc]
[<ffffffffa067e1c3>] ? ldlm_resource_putref+0x73/0x430 [ptlrpc]
[<ffffffffa067c7b3>] ? ldlm_lock_match+0x273/0x8f0 [ptlrpc]
[<ffffffff8116b342>] ? find_inode+0x62/0xb0
[<ffffffffa097dea2>] ? ll_update_inode+0x3a2/0x1140 [lustre]
[<ffffffffa099d150>] ? fid_test_inode+0x0/0x80 [lustre]
[<ffffffff8116c766>] ? ifind+0x66/0xc0
[<ffffffffa099d150>] ? fid_test_inode+0x0/0x80 [lustre]
[<ffffffffa06764d1>] ? ldlm_lock_add_to_lru_nolock+0x51/0xe0 [ptlrpc]
[<ffffffffa0676846>] ? ldlm_lock_add_to_lru+0x46/0x110 [ptlrpc]
[<ffffffffa067e6f9>] ? ldlm_resource_get+0xe9/0xc00 [ptlrpc]
[<ffffffffa067664d>] ? ldlm_lock_remove_from_lru_nolock+0x3d/0xe0
[ptlrpc]
[<ffffffffa0676951>] ? ldlm_lock_remove_from_lru+0x41/0x110 [ptlrpc]
[<ffffffffa067e1c3>] ? ldlm_resource_putref+0x73/0x430 [ptlrpc]
[<ffffffffa0676a41>] ? ldlm_lock_addref_internal_nolock+0x21/0xa0
[ptlrpc]
[<ffffffffa0677866>] ? search_queue+0xc6/0x170 [ptlrpc]
[<ffffffffa067c62a>] ? ldlm_lock_match+0xea/0x8f0 [ptlrpc]
[<ffffffffa043369e>] ? cfs_free+0xe/0x10 [libcfs]
[<ffffffffa06aedad>] ? __ptlrpc_req_finished+0x59d/0xb30 [ptlrpc]
[<ffffffffa09579f0>] ? ll_lookup_finish_locks+0x80/0x140 [lustre]
[<ffffffffa06764d1>] ? ldlm_lock_add_to_lru_nolock+0x51/0xe0 [ptlrpc]
[<ffffffffa0676846>] ? ldlm_lock_add_to_lru+0x46/0x110 [ptlrpc]
[<ffffffffa067bfb0>] ? ldlm_lock_decref_internal+0x2f0/0x880 [ptlrpc]
[<ffffffffa0678b8f>] ? __ldlm_handle2lock+0x9f/0x3d0 [ptlrpc]
[<ffffffffa0678b8f>] ? __ldlm_handle2lock+0x9f/0x3d0 [ptlrpc]
[<ffffffffa067cfa1>] ? ldlm_lock_decref+0x41/0xb0 [ptlrpc]
[<ffffffffa08e92f4>] ? mdc_set_lock_data+0xd4/0x270 [mdc]
[<ffffffffa0957950>] ? ll_intent_drop_lock+0xa0/0xc0 [lustre]
[<ffffffff8116966a>] ? d_kill+0xaa/0x110
[<ffffffffa09579f0>] ? ll_lookup_finish_locks+0x80/0x140 [lustre]
[<ffffffffa099d7ee>] ? ll_prepare_mdc_op_data+0xbe/0x120 [lustre]
[<ffffffffa04390c3>] ? ts_kernel_list_record_file_line+0x123/0x3c0
[libcfs]
[<ffffffffa0963f41>] ? __ll_inode_revalidate_it+0x191/0x6e0 [lustre]
[<ffffffffa099d990>] ? ll_mdc_blocking_ast+0x0/0x890 [lustre]
[<ffffffff8116a03b>] ? dput+0x9b/0x190
[<ffffffff8108854f>] ? up+0x2f/0x50
[<ffffffff814abb8e>] ? common_interrupt+0xe/0x13
[<ffffffffa099c9bb>] ? ll_stats_ops_tally+0x6b/0xd0 [lustre]
[<ffffffff8116fb1f>] ? mntput_no_expire+0x4f/0x1c0
[<ffffffff8116fcad>] ? mntput+0x1d/0x30
[<ffffffff8115d8b2>] ? path_put+0x22/0x30
[<ffffffff81157b23>] ? vfs_fstatat+0x73/0x80
[<ffffffff81157b54>] ? sys_newfstatat+0x24/0x50
[<ffffffff8100bf82>] ? system_call_fastpath+0x16/0x1b
Am I to understand that mntput_no_expire was running, then an interrupt
was called (i.e. common_interrupt). This then made some calls,
including a call to ll_mdc_blocking_ast. Is that right?
Is ll_mdc_blocking_ast supposed to run under an interrupt ???
Look how deep the stack goes on after that. No wonder there is a
lockup! Normally, I would expect an interrupt service routine to do
something quickly, and then start a worker thread if something more
substantial needed to be done.
Perhaps I am completely misunderstanding this stack trace. Can someone
please advise me?
Also, how do I interpret a line like this:
[<ffffffffa099d990>] ? ll_mdc_blocking_ast+0x0/0x890 [lustre]
Why is there a question mark? What are the two addresses separated by
the slash?
Thanks.
Roger Spellman
Staff Engineer
Terascala, Inc.
508-588-1501
www.terascala.com http://www.terascala.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110817/c4656d33/attachment.htm>
More information about the lustre-discuss
mailing list