[Lustre-discuss] MDS Problems

Charles Taylor taylor at hpc.ufl.edu
Fri Jun 13 13:03:26 PDT 2008


We have been running the config below on three different lustre file  
systems since early January and, for the most part, things have been  
pretty stable.    We are now experiencing frequent hangs on some  
clients - particularly our interactive login nodes.    All processes   
get blocked behind Lustre I/O requests.   When this happens there are  
*no* messages in either dmesg or syslog on the clients.     They seem  
unaware of a problem.

However, on the MDS we see the following...


1. A ton of lustre-log.M.N files get dumped into /tmp in a  short  
period of time.   Most of them appear to be full of garbage and  
unprintable characters rather than thread stack traces.   Many of them  
are also zero length.

2. Lots of dmesg output similar to that appended (see below).

3. Pretty much the same in syslog.

4. The frequency/period of these events seems to consistent with the  
timeouts associated with the following messages...

                 Lustre: 4534:0:(watchdog.c:312:lcw_update_time())  
Expired watchdog for pid 4534 disabled after 499.9514s


We have been adjusting lru_size on the clients but so far it has made  
no difference.    We have "options mds mds_num_threads=512" and our  
system timeout is 1000 (sure, go ahead and flame me but if we don't do  
that we get tons of "endpoint transport failures" on the clients and  
no, there are no connectivity issues).   :)

We are open to suggestion and wondering if we should update the MDSs  
to 1.6.5.   Can we do that safely without also upgrading the clients  
and OSTs?

Our config is as  below.

Thanks,

Charlie Taylor
UF HPC Center
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================

Config:
   Lustre 1.6.4.2
   CentOS 5.0
   Kernel 2.6.18-8.1.14
   OFED 1.2
   320 o2ib clients
     80  tcp clients


dmesg output:
=============
LustreError: dumping log to /tmp/lustre-log.1213385571.4240
ll_mdt_20     S ffff81023d2c9700     0  4070      1          4071   
4069 (L-TLB)
  ffff81023d2c9700 ffff81023d2c9630 ffff81023d2c9630 000000000000000a
  ffff81023fa09100 ffff8101438ca7a0 0002bb03681325fb 0000000000008f21
  ffff81023fa092e8 0000000000000001 ffffffff8866eb51 ffffffffffffffff
Call Trace:
  [<ffffffff8866eb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0
  [<ffffffff88686da0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
  [<ffffffff800611f7>] schedule_timeout+0x8a/0xad
  [<ffffffff80092c5e>] process_timeout+0x0/0x5
  [<ffffffff886881fd>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0
  [<ffffffff886705e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0
  [<ffffffff80086a74>] default_wake_function+0x0/0xe
  [<ffffffff8866ce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 
0x90
  [<ffffffff88684bb0>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
  [<ffffffff88685e24>] :ptlrpc:ldlm_cli_enqueue_local+0x454/0x510
  [<ffffffff888d5e87>] :mds:mds_fid2locked_dentry+0x1d7/0x2a0
  [<ffffffff88687ea0>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
  [<ffffffff888d6647>] :mds:mds_getattr_lock+0x6f7/0xc70
  [<ffffffff885b11c4>] :ksocklnd:ksocknal_alloc_tx+0x1c4/0x270
  [<ffffffff888d7191>] :mds:mds_intent_policy+0x5d1/0xbe0
  [<ffffffff8854dca7>] :lnet:lnet_prep_send+0x67/0xb0
  [<ffffffff88673786>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3b0
  [<ffffffff88670183>] :ptlrpc:ldlm_lock_enqueue+0xf3/0x5c0
  [<ffffffff8866dbbd>] :ptlrpc:ldlm_lock_create+0x98d/0x9c0
  [<ffffffff88690660>] :ptlrpc:ldlm_server_completion_ast+0x0/0x570
  [<ffffffff8868cda0>] :ptlrpc:ldlm_handle_enqueue+0xd90/0x1410
  [<ffffffff88690bd0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x690
  [<ffffffff888e0cad>] :mds:mds_handle+0x46dd/0x58ff
  [<ffffffff8860bcb2>] :obdclass:class_handle2object+0xd2/0x160
  [<ffffffff886a7280>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
  [<ffffffff886a4e35>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
  [<ffffffff886aca8b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270
  [<ffffffff800608e8>] thread_return+0x0/0xea
  [<ffffffff8006b165>] do_gettimeofday+0x50/0x92
  [<ffffffff8851b056>] :libcfs:lcw_update_time+0x16/0x100
  [<ffffffff8003cee3>] lock_timer_base+0x1b/0x3c
  [<ffffffff886af4cc>] :ptlrpc:ptlrpc_main+0x7dc/0x950
  [<ffffffff80086a74>] default_wake_function+0x0/0xe
  [<ffffffff8005be25>] child_rip+0xa/0x11
  [<ffffffff886aecf0>] :ptlrpc:ptlrpc_main+0x0/0x950
  [<ffffffff8005be1b>] child_rip+0x0/0x11

LustreError: dumping log to /tmp/lustre-log.1213385600.4070
Lustre: 4338:0:(ldlm_lib.c:519:target_handle_reconnect()) hpcdata- 
MDT0000: a9f365a3-8746-6ed5-e45e-cd292891ece2 reconnecting
Lustre: 4338:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 6  
previous similar messages
Lustre: 4338:0:(ldlm_lib.c:747:target_handle_connect()) hpcdata- 
MDT0000: refuse reconnection from a9f365a3-8746-6ed5-e45e-cd292891ece2 at 10.13.25.198 
@o2ib to 0xffff8101cdbd6000; still busy with 2 active RPCs
Lustre: 4338:0:(ldlm_lib.c:747:target_handle_connect()) Skipped 6  
previous similar messages
ll_mdt_501    S ffff81023ca51700     0  4551      1          4552   
4550 (L-TLB)
  ffff81023ca51700 ffff81023ca51630 ffff81023ca51630 000000000000000a
  ffff81023c9ce860 ffff81014395c820 0002bb22256f1724 0000000000009eb2
  ffff81023c9cea48 0000000000000002 ffffffff8866eb51 ffffffffffffffff
Call Trace:
  [<ffffffff8866eb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0
  [<ffffffff88686da0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
  [<ffffffff800611f7>] schedule_timeout+0x8a/0xad
  [<ffffffff80092c5e>] process_timeout+0x0/0x5
  [<ffffffff886881fd>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0
  [<ffffffff886705e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0
  [<ffffffff80086a74>] default_wake_function+0x0/0xe
  [<ffffffff8866ce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 
0x90
  [<ffffffff88684bb0>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
  [<ffffffff88685e24>] :ptlrpc:ldlm_cli_enqueue_local+0x454/0x510
  [<ffffffff888d5e87>] :mds:mds_fid2locked_dentry+0x1d7/0x2a0
  [<ffffffff88687ea0>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
  [<ffffffff888d6647>] :mds:mds_getattr_lock+0x6f7/0xc70
  [<ffffffff8858fd7c>] :ko2iblnd:kiblnd_send+0x88c/0x8d0
  [<ffffffff888d7191>] :mds:mds_intent_policy+0x5d1/0xbe0
  [<ffffffff8854dca7>] :lnet:lnet_prep_send+0x67/0xb0
  [<ffffffff88673786>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3b0
  [<ffffffff88670183>] :ptlrpc:ldlm_lock_enqueue+0xf3/0x5c0
  [<ffffffff8866dbbd>] :ptlrpc:ldlm_lock_create+0x98d/0x9c0
  [<ffffffff88690660>] :ptlrpc:ldlm_server_completion_ast+0x0/0x570
  [<ffffffff8868cda0>] :ptlrpc:ldlm_handle_enqueue+0xd90/0x1410
  [<ffffffff88690bd0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x690
  [<ffffffff888e0cad>] :mds:mds_handle+0x46dd/0x58ff
  [<ffffffff8860bcb2>] :obdclass:class_handle2object+0xd2/0x160
  [<ffffffff886a7280>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
  [<ffffffff886a4e35>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
  [<ffffffff886aca8b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270
  [<ffffffff800608e8>] thread_return+0x0/0xea
  [<ffffffff8006b165>] do_gettimeofday+0x50/0x92
  [<ffffffff8851b056>] :libcfs:lcw_update_time+0x16/0x100
  [<ffffffff8003cee3>] lock_timer_base+0x1b/0x3c
  [<ffffffff886af4cc>] :ptlrpc:ptlrpc_main+0x7dc/0x950
  [<ffffffff80086a74>] default_wake_function+0x0/0xe
  [<ffffffff8005be25>] child_rip+0xa/0x11
  [<ffffffff886aecf0>] :ptlrpc:ptlrpc_main+0x0/0x950
  [<ffffffff8005be1b>] child_rip+0x0/0x11

LustreError: dumping log to /tmp/lustre-log.1213385732.4551
ll_mdt_13     S ffff81023d2a7700     0  4063      1          4064   
4062 (L-TLB)
  ffff81023d2a7700 ffff81023d2a7630 ffff81023d2a7630 000000000000000a
  ffff81023e0577a0 ffff8101438ca7a0 0002bb23b76dc267 00000000000104cb
  ffff81023e057988 0000000000000001 ffffffff8866eb51 ffffffffffffffff
Call Trace:
  [<ffffffff8866eb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0
  [<ffffffff88686da0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
  [<ffffffff800611f7>] schedule_timeout+0x8a/0xad
  [<ffffffff80092c5e>] process_timeout+0x0/0x5
  [<ffffffff886881fd>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0
  [<ffffffff886705e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0
  [<ffffffff80086a74>] default_wake_function+0x0/0xe
  [<ffffffff8866ce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 
0x90
  [<ffffffff88684bb0>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
  [<ffffffff88685e24>] :ptlrpc:ldlm_cli_enqueue_local+0x454/0x510
  [<ffffffff888d5e87>] :mds:mds_fid2locked_dentry+0x1d7/0x2a0
  [<ffffffff88687ea0>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
  [<ffffffff888d6647>] :mds:mds_getattr_lock+0x6f7/0xc70
  [<ffffffff8858fd7c>] :ko2iblnd:kiblnd_send+0x88c/0x8d0
  [<ffffffff888d7191>] :mds:mds_intent_policy+0x5d1/0xbe0
  [<ffffffff8854dca7>] :lnet:lnet_prep_send+0x67/0xb0
  [<ffffffff88673786>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3b0
  [<ffffffff88670183>] :ptlrpc:ldlm_lock_enqueue+0xf3/0x5c0
  [<ffffffff8866dbbd>] :ptlrpc:ldlm_lock_create+0x98d/0x9c0
  [<ffffffff88690660>] :ptlrpc:ldlm_server_completion_ast+0x0/0x570
  [<ffffffff8868cda0>] :ptlrpc:ldlm_handle_enqueue+0xd90/0x1410
  [<ffffffff88690bd0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x690
  [<ffffffff888e0cad>] :mds:mds_handle+0x46dd/0x58ff
  [<ffffffff8860bcb2>] :obdclass:class_handle2object+0xd2/0x160
  [<ffffffff886a7280>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
  [<ffffffff886a4e35>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
  [<ffffffff886aca8b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270
  [<ffffffff800608e8>] thread_return+0x0/0xea
  [<ffffffff8006b165>] do_gettimeofday+0x50/0x92
  [<ffffffff8851b056>] :libcfs:lcw_update_time+0x16/0x100
  [<ffffffff8003cee3>] lock_timer_base+0x1b/0x3c
  [<ffffffff886af4cc>] :ptlrpc:ptlrpc_main+0x7dc/0x950
  [<ffffffff80086a74>] default_wake_function+0x0/0xe
  [<ffffffff8005be25>] child_rip+0xa/0x11
  [<ffffffff886aecf0>] :ptlrpc:ptlrpc_main+0x0/0x950
  [<ffffffff8005be1b>] child_rip+0x0/0x11

LustreError: dumping log to /tmp/lustre-log.1213385739.4063
ll_mdt_115    S ffff81023d1a9700     0  4165      1          4166   
4164 (L-TLB)
  ffff81023d1a9700 ffff81023d1a9630 ffff81023d1a9630 000000000000000a
  ffff81023ee5b860 ffff8101438ca7a0 0002bb28f86558dc 000000000001089d
  ffff81023ee5ba48 0000000000000001 ffffffff8866eb51 ffffffffffffffff
Call Trace:
  [<ffffffff8866eb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0
  [<ffffffff88686da0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
  [<ffffffff800611f7>] schedule_timeout+0x8a/0xad
  [<ffffffff80092c5e>] process_timeout+0x0/0x5
  [<ffffffff886881fd>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0
  [<ffffffff886705e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0
  [<ffffffff80086a74>] default_wake_function+0x0/0xe
  [<ffffffff8866ce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 
0x90
  [<ffffffff88684bb0>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
  [<ffffffff88685e24>] :ptlrpc:ldlm_cli_enqueue_local+0x454/0x510
  [<ffffffff888d5e87>] :mds:mds_fid2locked_dentry+0x1d7/0x2a0
  [<ffffffff88687ea0>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
  [<ffffffff888d6647>] :mds:mds_getattr_lock+0x6f7/0xc70
  [<ffffffff885b11c4>] :ksocklnd:ksocknal_alloc_tx+0x1c4/0x270
  [<ffffffff888d7191>] :mds:mds_intent_policy+0x5d1/0xbe0
  [<ffffffff8854dca7>] :lnet:lnet_prep_send+0x67/0xb0
  [<ffffffff88673786>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3b0
  [<ffffffff88670183>] :ptlrpc:ldlm_lock_enqueue+0xf3/0x5c0
  [<ffffffff8866dbbd>] :ptlrpc:ldlm_lock_create+0x98d/0x9c0
  [<ffffffff88690660>] :ptlrpc:ldlm_server_completion_ast+0x0/0x570
  [<ffffffff8868cda0>] :ptlrpc:ldlm_handle_enqueue+0xd90/0x1410
  [<ffffffff88690bd0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x690
  [<ffffffff888e0cad>] :mds:mds_handle+0x46dd/0x58ff
  [<ffffffff8860bcb2>] :obdclass:class_handle2object+0xd2/0x160
  [<ffffffff886a7280>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
  [<ffffffff886a4e35>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
  [<ffffffff886aca8b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270
  [<ffffffff800608e8>] thread_return+0x0/0xea
  [<ffffffff8006b165>] do_gettimeofday+0x50/0x92
  [<ffffffff8851b056>] :libcfs:lcw_update_time+0x16/0x100
  [<ffffffff8003cee3>] lock_timer_base+0x1b/0x3c
  [<ffffffff886af4cc>] :ptlrpc:ptlrpc_main+0x7dc/0x950
  [<ffffffff80086a74>] default_wake_function+0x0/0xe
  [<ffffffff8005be25>] child_rip+0xa/0x11
  [<ffffffff886aecf0>] :ptlrpc:ptlrpc_main+0x0/0x950
  [<ffffffff8005be1b>] child_rip+0x0/0x11

LustreError: dumping log to /tmp/lustre-log.1213385762.4165
Lustre: 4135:0:(ldlm_lib.c:519:target_handle_reconnect()) hpcdata- 
MDT0000: a9f365a3-8746-6ed5-e45e-cd292891ece2 reconnecting
Lustre: 4135:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 11  
previous similar messages
Lustre: 4135:0:(ldlm_lib.c:747:target_handle_connect()) hpcdata- 
MDT0000: refuse reconnection from a9f365a3-8746-6ed5-e45e-cd292891ece2 at 10.13.25.198 
@o2ib to 0xffff8101cdbd6000; still busy with 2 active RPCs
Lustre: 4135:0:(ldlm_lib.c:747:target_handle_connect()) Skipped 11  
previous similar messages
LustreError: 30076:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@  
processing error (-107)  req at ffff810143b27450 x4795081/t0 o101-><?>@<? 
 >:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0
LustreError: 30076:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped  
4938 previous similar messages
LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### lock  
callback timer expired: evicting client ac5fac72- 
d4bf-8ff6-9e28-664d7c46678b at NET_0x500000a0d1972_UUID nid  
10.13.25.114 at o2ib  ns: mds-hpcdata-MDT0000_UUID lock:  
ffff8101207a9d00/0x3d8831a1452e3fc0 lrc: 1/0,0 mode: CR/CR res:  
3673528/1497616058 bits 0x3 rrc: 44 type: IBT flags: 4000020 remote:  
0xa943eb2122dba12d expref: 11 pid 4407
Lustre: 4534:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for  
pid 4534 disabled after 499.9514s
Lustre: 4534:0:(watchdog.c:312:lcw_update_time()) Skipped 40 previous  
similar messages
LustreError: 4039:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs:  
operation 101 on unconnected MGS
LustreError: 4039:0:(mgs_handler.c:515:mgs_handle()) Skipped 4927  
previous similar messages
LustreError: 4557:0:(handler.c:1499:mds_handle()) operation 101 on  
unconnected MDS from 12345-10.13.25.114 at o2ib
[root at homemds tmp]#








More information about the lustre-discuss mailing list