[Lustre-discuss] Apparent deadlock of MDS

Mark Hills Mark.Hills at framestore.com
Thu Mar 31 14:35:17 PDT 2011


We are have recently begun having continual problems with blockages at our 
Lustre MDS. Lustre v1.6.7.2.

The symptom of the problem appears to be timeouts when taking locks. There 
are a diverse range, including backtraces in dmesg. Some examples are 
below. The MDS stalls and all clients are evicted.

It looks like some kind of deadlock; commonly IBT locks, CR/CR bits 0x3. 
At this stage I have only a little understanding of what this actually 
means. A range of resource IDs are present, but some more than others.

Is there a way to trace the resource ID in question, perhaps to a 
directory or file that could hint at which client is causing this?

Or am I missing a clear indication of what the problem is from these 
messages?

We're so far unable to trace this to the thread which holds the lock. All 
kernel threads are in 'S' state; none are spinning. CPU usage and I/O wait 
at the server is near 0%.

There is no correlation between these symptoms and high load or data 
transfer; Lustre performs well inbetween failures. A weak pattern is that 
all of the 19 failures in the last 2 weeks have been during office hours.

Thanks in advance for any assistance.

-- 
Mark


00010000:00020000:2:1301592505.733870:0:0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 255s: evicting client at 172.16.165.14 at tcp  ns: mds-mdt1d_UUID lock: ffff8103f6a8ee00/0xc49aa2b4caebde2b lrc: 3/0,0 mode: CR/CR res: 358522091/1997454789 bits 0x3 rrc: 25 type: IBT flags: 0x4000030 remote: 0xbafba6311dca2ad2 expref: 33 pid: 30246 timeout: 4297990001
00000100:00020000:0:1301592505.735005:0:16327:0:(service.c:1277:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-172.16.164.37 at tcp: deadline 40+11s ago
  req at ffff8103d1166c00 x1364539602108384/t0 o400->cb62963a-6d1d-2ae5-b87e-6c2393924052 at NET_0x20000ac10a425_UUID:0/0 lens 192/0 e 0 to 0 dl 1301592494 ref 1 fl Interpret:H/0/0 rc 0/0
00000100:00000400:0:1301592505.735018:0:16327:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request x1364539602108384 took longer than estimated (40+11s); client may timeout.  req at ffff8103d1166c00 x1364539602108384/t0 o400->cb62963a-6d1d-2ae5-b87e-6c2393924052 at NET_0x20000ac10a425_UUID:0/0 lens 192/0 e 0 to 0 dl 1301592494 ref 1 fl Interpret:H/0/0 rc 0/0

00010000:00020000:6:1300711826.071353:0:0:0:(ldlm_lockd.c:249:waiting_locks_callback()) ### lock callback timer expired after 51s: evicting client at 172.16.165.23 at tcp  ns: mds-mdt1d_UUID lock: ffff8101663d7a00/0x75ebf93823ef9ba2 lrc: 1/0,0 mode: CR/CR res: 4100920/1836250168 bits 0x3 rrc: 4 type: IBT flags: 4000020 remote: 0x4efd46d542073180 expref: 35 pid 6431
00010000:00020000:6:1300711826.071359:0:0:0:(ldlm_lockd.c:249:waiting_locks_callback()) ### lock callback timer expired after 153s: evicting client at 172.16.165.65 at tcp  ns: mds-mdt1d_UUID lock: ffff81030822ee00/0x75ebf93823ddb366 lrc: 1/0,0 mode: CR/CR res: 4100773/1836249963 bits 0x3 rrc: 89 type: IBT flags: 4000030 remote: 0x159e3bd63d4c7bda expref: 730 pid 6454
00010000:00020000:6:1300711826.071365:0:0:0:(ldlm_lockd.c:249:waiting_locks_callback()) ### lock callback timer expired after 143s: evicting client at 172.16.226.67 at tcp  ns: mds-mdt1d_UUID lock: ffff81007ee41600/0x75ebf93823ef58ea lrc: 1/0,0 mode: CR/CR res: 4100773/1836249963 bits 0x3 rrc: 89 type: IBT flags: 4000030 remote: 0x37f1359dbb3d76d2 expref: 42 pid 6419


ll_mdt_52     S ffff81000100c980     0 12316      1         12317 12315 
(L-TLB)
 ffff8103f08335f0 0000000000000046 ffff8103df6837c0 ffff81042ff0f100
 0000000000000286 000000000000000a ffff8103f1fce7e0 ffff81042ff0f100
 000002f16f45dd62 0000000000001e46 ffff8103f1fce9c8 00000001ffffffff
Call Trace:
 [<ffffffff885518a0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
 [<ffffffff88553435>] :ptlrpc:ldlm_completion_ast+0x455/0x7c0
 [<ffffffff885395a9>] :ptlrpc:ldlm_lock_enqueue+0x9b9/0xb00
 [<ffffffff8008a5a0>] default_wake_function+0x0/0xe
 [<ffffffff88534b5a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/0x90
 [<ffffffff88551fe8>] :ptlrpc:ldlm_cli_enqueue_local+0x4f8/0x5b0
 [<ffffffff887f3027>] :mds:enqueue_ordered_locks+0x387/0x4d0
 [<ffffffff8854f750>] :ptlrpc:ldlm_blocking_ast+0x0/0x2a0
 [<ffffffff88552fe0>] :ptlrpc:ldlm_completion_ast+0x0/0x7c0
 [<ffffffff887f37f9>] :mds:mds_get_parent_child_locked+0x689/0x8e0
 [<ffffffff88534ab4>] :ptlrpc:ldlm_lock_remove_from_lru+0x74/0xe0
 [<ffffffff8853319e>] :ptlrpc:lock_res_and_lock+0xbe/0xe0
 [<ffffffff887e5372>] :mds:mds_getattr_lock+0x612/0xca0
 [<ffffffff8846e763>] :lnet:lnet_ni_send+0x93/0xd0
 [<ffffffff887e031a>] :mds:fixup_handle_for_resent_req+0x5a/0x2c0
 [<ffffffff887ebccf>] :mds:mds_intent_policy+0x62f/0xc10
 [<ffffffff8853b726>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
 [<ffffffff88538d76>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb00
 [<ffffffff8855392f>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0
 [<ffffffff884bc6a8>] :obdclass:lustre_hash_add+0x208/0x2d0
 [<ffffffff8855c500>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x833
 [<ffffffff88559619>] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210
 [<ffffffff887e9a6f>] :mds:mds_handle+0x406f/0x4d20
 [<ffffffff8854f8ee>] :ptlrpc:ldlm_blocking_ast+0x19e/0x2a0
 [<ffffffff800890f4>] find_busiest_group+0x20d/0x621
 [<ffffffff8853319e>] :ptlrpc:lock_res_and_lock+0xbe/0xe0
 [<ffffffff88576815>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
 [<ffffffff88534ab4>] :ptlrpc:ldlm_lock_remove_from_lru+0x74/0xe0
 [<ffffffff8857ef5a>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150
 [<ffffffff88580ddd>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
 [<ffffffff88583363>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
 [<ffffffff8006dd33>] do_gettimeofday+0x40/0x8f
 [<ffffffff884267b6>] :libcfs:lcw_update_time+0x16/0x100
 [<ffffffff8857fd41>] :ptlrpc:ptlrpc_server_handle_reply+0x4a1/0x5c0
 [<ffffffff88586858>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
 [<ffffffff800419d5>] end_bio_bh_io_sync+0x0/0x3b
 [<ffffffff80050aa8>] bio_fs_destructor+0x0/0xc
 [<ffffffff8008a5a0>] default_wake_function+0x0/0xe
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff88585640>] :ptlrpc:ptlrpc_main+0x0/0x13e0
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


ll_mdt_104    S ffff8103f1f8a820     0 30262      1         30263 30259 
(L-TLB)
 ffff8103de9815a0 0000000000000046 ffff8104143f8000 ffff8103de9815c0
 ffff8104076df118 000000000000000a ffff810412a6f0c0 ffff8103f1f8a820
 0000032d30d14bf1 0000000000003e66 ffff810412a6f2a8 00000004076df128
Call Trace:
 [<ffffffff8003d52e>] lock_timer_base+0x1b/0x3c
 [<ffffffff8001c793>] __mod_timer+0xb0/0xbe
 [<ffffffff885518a0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
 [<ffffffff8006395c>] schedule_timeout+0x8a/0xad
 [<ffffffff8009510e>] process_timeout+0x0/0x5
 [<ffffffff88553448>] :ptlrpc:ldlm_completion_ast+0x468/0x7c0
 [<ffffffff885395a9>] :ptlrpc:ldlm_lock_enqueue+0x9b9/0xb00
 [<ffffffff8008a5a0>] default_wake_function+0x0/0xe
 [<ffffffff88534b5a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/0x90
 [<ffffffff88551fe8>] :ptlrpc:ldlm_cli_enqueue_local+0x4f8/0x5b0
 [<ffffffff887f3027>] :mds:enqueue_ordered_locks+0x387/0x4d0
 [<ffffffff8854f750>] :ptlrpc:ldlm_blocking_ast+0x0/0x2a0
 [<ffffffff88552fe0>] :ptlrpc:ldlm_completion_ast+0x0/0x7c0
 [<ffffffff887f37f9>] :mds:mds_get_parent_child_locked+0x689/0x8e0
 [<ffffffff887e5372>] :mds:mds_getattr_lock+0x612/0xca0
 [<ffffffff8846eea7>] :lnet:lnet_prep_send+0x67/0xb0
 [<ffffffff887e031a>] :mds:fixup_handle_for_resent_req+0x5a/0x2c0
 [<ffffffff887ebccf>] :mds:mds_intent_policy+0x62f/0xc10
 [<ffffffff8853b726>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
 [<ffffffff88538d76>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb00
 [<ffffffff8855392f>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0
 [<ffffffff884bc6a8>] :obdclass:lustre_hash_add+0x208/0x2d0
 [<ffffffff8855c500>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x833
 [<ffffffff88559619>] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210
 [<ffffffff887e9a6f>] :mds:mds_handle+0x406f/0x4d20
 [<ffffffff8854f8ee>] :ptlrpc:ldlm_blocking_ast+0x19e/0x2a0
 [<ffffffff8853319e>] :ptlrpc:lock_res_and_lock+0xbe/0xe0
 [<ffffffff88576815>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
 [<ffffffff88534ab4>] :ptlrpc:ldlm_lock_remove_from_lru+0x74/0xe0
 [<ffffffff8857ef5a>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150
 [<ffffffff88580ddd>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
 [<ffffffff88583363>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
 [<ffffffff8006dd33>] do_gettimeofday+0x40/0x8f
 [<ffffffff884267b6>] :libcfs:lcw_update_time+0x16/0x100
 [<ffffffff8003d52e>] lock_timer_base+0x1b/0x3c
 [<ffffffff8001c793>] __mod_timer+0xb0/0xbe
 [<ffffffff88586858>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
 [<ffffffff8008a5a0>] default_wake_function+0x0/0xe
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff88585640>] :ptlrpc:ptlrpc_main+0x0/0x13e0
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


ll_mdt_35     S ffff810001014f00     0 25101      1         25102 25100 
(L-TLB)
 ffff810407bb96f0 0000000000000046 ffff8103ffb30840 ffff81042ff1c080
 0000000000000286 000000000000000a ffff810408d777e0 ffff81042ff1c080
 0000032cbc4c5fd3 0000000000001eb0 ffff810408d779c8 00000002ffffffff
Call Trace:
 [<ffffffff885518a0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
 [<ffffffff88553435>] :ptlrpc:ldlm_completion_ast+0x455/0x7c0
 [<ffffffff885395a9>] :ptlrpc:ldlm_lock_enqueue+0x9b9/0xb00
 [<ffffffff8008a5a0>] default_wake_function+0x0/0xe
 [<ffffffff88534b5a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/0x90
 [<ffffffff88551fe8>] :ptlrpc:ldlm_cli_enqueue_local+0x4f8/0x5b0
 [<ffffffff887e4c9e>] :mds:mds_fid2locked_dentry+0x1ce/0x290
 [<ffffffff8854f750>] :ptlrpc:ldlm_blocking_ast+0x0/0x2a0
 [<ffffffff88552fe0>] :ptlrpc:ldlm_completion_ast+0x0/0x7c0
 [<ffffffff887e53a7>] :mds:mds_getattr_lock+0x647/0xca0
 [<ffffffff88578c94>] :ptlrpc:lustre_msg_add_version+0x34/0x110
 [<ffffffff8846e763>] :lnet:lnet_ni_send+0x93/0xd0
 [<ffffffff887e031a>] :mds:fixup_handle_for_resent_req+0x5a/0x2c0
 [<ffffffff887ebccf>] :mds:mds_intent_policy+0x62f/0xc10
 [<ffffffff8003d639>] ifind_fast+0x47/0x83
 [<ffffffff8853b726>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
 [<ffffffff88538d76>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb00
 [<ffffffff8855392f>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0
 [<ffffffff884bc6a8>] :obdclass:lustre_hash_add+0x208/0x2d0
 [<ffffffff8855c500>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x833
 [<ffffffff88559619>] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210
 [<ffffffff887e9a6f>] :mds:mds_handle+0x406f/0x4d20
 [<ffffffff88576815>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
 [<ffffffff8857ef5a>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150
 [<ffffffff88580ddd>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
 [<ffffffff88583363>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
 [<ffffffff8006dd33>] do_gettimeofday+0x40/0x8f
 [<ffffffff884267b6>] :libcfs:lcw_update_time+0x16/0x100
 [<ffffffff8003d52e>] lock_timer_base+0x1b/0x3c
 [<ffffffff8001c793>] __mod_timer+0xb0/0xbe
 [<ffffffff88586858>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
 [<ffffffff8008a5a0>] default_wake_function+0x0/0xe
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff88585640>] :ptlrpc:ptlrpc_main+0x0/0x13e0
 [<ffffffff8005dfa7>] child_rip+0x0/0x11




More information about the lustre-discuss mailing list