[Lustre-discuss] System Deadlock
    Roger Spellman 
    Roger.Spellman at terascala.com
       
    Wed Aug 17 13:23:00 PDT 2011
    
    
  
Hi,
I am in the process of porting Lustre client 1.8.4 to a recent kernel,
2.6.38.8.  This has been a challenge for a variety of reasons, such as
the dcache_lock being removed from the kernel.  I am pretty close to
having it working, but I still can generate system lockup. So far, I
have only tested Lustre over TCP over ethernet.
 
Using magic-sysrq, I was able to get a call trace.  But, I am not sure
if I fully understand it.  Can someone please validate my analysis?  
 
Here is one of the running threads:
 
rm              R  running task        0  2039   2030 0x00000088
 ffff88011cfbb578 ffffffffa059c8df 0000000000000000 0000000000000000
 ffff88011caf8880 000000000001ce7d 00000000000000c1 0000000000000000
 ffff88011cfbb548 ffffffffa06bfe34 0000000000000000 ffff88011ce9c800
Call Trace:
 [<ffffffffa059c8df>] ? LNetMDUnlink+0x6f/0x110 [lnet]
 [<ffffffffa06bfe34>] ? lustre_msg_get_slv+0x94/0x100 [ptlrpc]
 [<ffffffff8104ce53>] ? __wake_up+0x53/0x70
 [<ffffffffa0698c29>] ? ldlm_completion_ast+0x349/0x8d0 [ptlrpc]
 [<ffffffffa067af48>] ? ldlm_lock_enqueue+0x228/0xbb0 [ptlrpc]
 [<ffffffffa0675148>] ? lock_res_and_lock+0x58/0xe0 [ptlrpc]
 [<ffffffffa067b9fd>] ? ldlm_lock_change_resource+0x12d/0x3f0 [ptlrpc]
 [<ffffffffa067e6f9>] ? ldlm_resource_get+0xe9/0xc00 [ptlrpc]
 [<ffffffffa067e1c3>] ? ldlm_resource_putref+0x73/0x430 [ptlrpc]
 [<ffffffffa067c7b3>] ? ldlm_lock_match+0x273/0x8f0 [ptlrpc]
 [<ffffffff8116b342>] ? find_inode+0x62/0xb0
 [<ffffffffa097dea2>] ? ll_update_inode+0x3a2/0x1140 [lustre]
 [<ffffffffa099d150>] ? fid_test_inode+0x0/0x80 [lustre]
 [<ffffffff8116c766>] ? ifind+0x66/0xc0
 [<ffffffffa099d150>] ? fid_test_inode+0x0/0x80 [lustre]
 [<ffffffffa06764d1>] ? ldlm_lock_add_to_lru_nolock+0x51/0xe0 [ptlrpc]
 [<ffffffffa0676846>] ? ldlm_lock_add_to_lru+0x46/0x110 [ptlrpc]
 [<ffffffffa067e6f9>] ? ldlm_resource_get+0xe9/0xc00 [ptlrpc]
 [<ffffffffa067664d>] ? ldlm_lock_remove_from_lru_nolock+0x3d/0xe0
[ptlrpc]
 [<ffffffffa0676951>] ? ldlm_lock_remove_from_lru+0x41/0x110 [ptlrpc]
 [<ffffffffa067e1c3>] ? ldlm_resource_putref+0x73/0x430 [ptlrpc]
 [<ffffffffa0676a41>] ? ldlm_lock_addref_internal_nolock+0x21/0xa0
[ptlrpc]
 [<ffffffffa0677866>] ? search_queue+0xc6/0x170 [ptlrpc]
 [<ffffffffa067c62a>] ? ldlm_lock_match+0xea/0x8f0 [ptlrpc]
 [<ffffffffa043369e>] ? cfs_free+0xe/0x10 [libcfs]
 [<ffffffffa06aedad>] ? __ptlrpc_req_finished+0x59d/0xb30 [ptlrpc]
 [<ffffffffa09579f0>] ? ll_lookup_finish_locks+0x80/0x140 [lustre]
 [<ffffffffa06764d1>] ? ldlm_lock_add_to_lru_nolock+0x51/0xe0 [ptlrpc]
 [<ffffffffa0676846>] ? ldlm_lock_add_to_lru+0x46/0x110 [ptlrpc]
 [<ffffffffa067bfb0>] ? ldlm_lock_decref_internal+0x2f0/0x880 [ptlrpc]
 [<ffffffffa0678b8f>] ? __ldlm_handle2lock+0x9f/0x3d0 [ptlrpc]
 [<ffffffffa0678b8f>] ? __ldlm_handle2lock+0x9f/0x3d0 [ptlrpc]
 [<ffffffffa067cfa1>] ? ldlm_lock_decref+0x41/0xb0 [ptlrpc]
 [<ffffffffa08e92f4>] ? mdc_set_lock_data+0xd4/0x270 [mdc]
 [<ffffffffa0957950>] ? ll_intent_drop_lock+0xa0/0xc0 [lustre]
 [<ffffffff8116966a>] ? d_kill+0xaa/0x110
 [<ffffffffa09579f0>] ? ll_lookup_finish_locks+0x80/0x140 [lustre]
 [<ffffffffa099d7ee>] ? ll_prepare_mdc_op_data+0xbe/0x120 [lustre]
 [<ffffffffa04390c3>] ? ts_kernel_list_record_file_line+0x123/0x3c0
[libcfs]
 [<ffffffffa0963f41>] ? __ll_inode_revalidate_it+0x191/0x6e0 [lustre]
 [<ffffffffa099d990>] ? ll_mdc_blocking_ast+0x0/0x890 [lustre]
 [<ffffffff8116a03b>] ? dput+0x9b/0x190
 [<ffffffff8108854f>] ? up+0x2f/0x50
 [<ffffffff814abb8e>] ? common_interrupt+0xe/0x13
 [<ffffffffa099c9bb>] ? ll_stats_ops_tally+0x6b/0xd0 [lustre]
 [<ffffffff8116fb1f>] ? mntput_no_expire+0x4f/0x1c0
 [<ffffffff8116fcad>] ? mntput+0x1d/0x30
 [<ffffffff8115d8b2>] ? path_put+0x22/0x30
 [<ffffffff81157b23>] ? vfs_fstatat+0x73/0x80
 [<ffffffff81157b54>] ? sys_newfstatat+0x24/0x50
 [<ffffffff8100bf82>] ? system_call_fastpath+0x16/0x1b
 
Am I to understand that mntput_no_expire was running, then an interrupt
was called (i.e. common_interrupt).  This then made some calls,
including a call to ll_mdc_blocking_ast.  Is that right?
 
Is ll_mdc_blocking_ast supposed to run under an interrupt ???
 
Look how deep the stack goes on after that.  No wonder there is a
lockup!  Normally, I would expect an interrupt service routine to do
something quickly, and then start a worker thread if something more
substantial needed to be done.
 
Perhaps I am completely misunderstanding this stack trace.  Can someone
please advise me?
 
Also, how do I interpret a line like this:
 
[<ffffffffa099d990>] ? ll_mdc_blocking_ast+0x0/0x890 [lustre]
Why is there a question mark?  What are the two addresses separated by
the slash?
 
Thanks.
 
Roger Spellman
Staff Engineer
Terascala, Inc.
508-588-1501
www.terascala.com http://www.terascala.com/
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110817/c4656d33/attachment.htm>
    
    
More information about the lustre-discuss
mailing list