[Lustre-discuss] Problem on MDS.

Θεόδωρος Στυλιανός Κονδύλης theodoros.stylianos.kondylis at gmail.com
Fri Jan 25 08:03:14 PST 2013


Hello everyone,

On lustre v1.8.4, since last night we are facing problems on our scratch
FS. So I went to the respective mds where on /var/log/messages I found the
following ::

Jan 24 23:12:13 mds kernel: LustreError:
> 5107:0:(ldlm_lock.c:430:__ldlm_handle2lock()) ASSERTION(handle) failed
> Jan 24 23:12:13 mds kernel: LustreError:
> 5107:0:(ldlm_lock.c:430:__ldlm_handle2lock()) LBUG
> Jan 24 23:12:13 mds kernel: Pid: 5107, comm: ll_mdt_106
> Jan 24 23:12:13 mds kernel:
> Jan 24 23:12:13 mds kernel: Call Trace:
> Jan 24 23:12:13 mds kernel:  [<ffffffffa05d88ba>]
> libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa05d8e2c>] lbug_with_loc+0x6c/0xd0
> [libcfs]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa05e2550>] tracefile_init+0x0/0x170
> [libcfs]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa06eb24f>]
> __ldlm_handle2lock+0x35f/0x3b0 [ptlrpc]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa06ec00f>]
> ldlm_lock_decref+0x1f/0xb0 [ptlrpc]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa09ba124>]
> mds_verify_child+0x4e4/0x8b0 [mds]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa09bcf39>]
> mds_get_parent_child_locked+0x4e9/0x980 [mds]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa09e0886>] mds_open+0x716/0x3680
> [mds]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa09b73c7>]
> mds_reint_rec+0x157/0x2d0 [mds]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa09ac207>] mds_reint+0x1d7/0x450
> [mds]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa09b1ae0>]
> mds_intent_policy+0x6f0/0xd20 [mds]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa06ec958>]
> ldlm_lock_enqueue+0x468/0xbf0 [ptlrpc]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa070fe85>]
> ldlm_handle_enqueue+0x265/0x1300 [ptlrpc]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa09b2bcc>] mds_handle+0xabc/0x5150
> [mds]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa073ea32>]
> ptlrpc_server_handle_request+0x902/0x1330 [ptlrpc]
> Jan 24 23:12:13 mds kernel:  [<ffffffffa07411a4>] ptlrpc_main+0x6d4/0x1610
> [ptlrpc]
> Jan 24 23:12:13 mds kernel:  [<ffffffff8020cf49>] child_rip+0xa/0x11
> Jan 24 23:12:13 mds kernel:
> Jan 24 23:12:13 mds kernel: LustreError: dumping log to
> /tmp/lustre-log.1359065533.5107


By googling I found that this is a bug fixed in version 1.8.6 (
http://jira.whamcloud.com/browse/LU-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
).

After that follows a long list of lustre Errors as the one following that
keeps growing as we speak.

Jan 24 23:25:38 mds kernel: LustreError:
> 23676:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed
> out (enqueued at 1359066138, 200s ago); not entering recovery in server
> code, just going back to sleep ns: mds-scratch-MDT0000_UUID lock:
> ffff88080a81f800/0x48997a721de79abd lrc: 3/0,1 mode: --/EX res:
> 381225206/1751756546 bits 0x3 rrc: 63 type: IBT flags: 0x4004030 remote:
> 0x0 expref: -99 pid: 23676 timeout: 0
> Jan 24 23:25:38 mds kernel: LustreError: dumping log to
> /tmp/lustre-log.1359066338.23676
> Jan 24 23:26:06 mds kernel: Lustre:
> 23677:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
> ba8ef0e3-630b-714c-4a37-b7558adba3d0 reconnecting
> Jan 24 23:26:25 mds kernel: LustreError:
> 23659:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed
> out (enqueued at 1359066185, 200s ago); not entering recovery in server
> code, just going back to sleep ns: mds-scratch-MDT0000_UUID lock:
> ffff880c04ded200/0x48997a721e02cf62 lrc: 3/1,0 mode: --/CR res:
> 381225206/1751756546 bits 0x3 rrc: 64 type: IBT flags: 0x4004000 remote:
> 0x0 expref: -99 pid: 23659 timeout: 0
> Jan 24 23:26:26 mds kernel: LustreError:
> 5071:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed
> out (enqueued at 1359066186, 200s ago); not entering recovery in server
> code, just going back to sleep ns: mds-scratch-MDT0000_UUID lock:
> ffff880b4b213a00/0x48997a721e03a0b3 lrc: 3/1,0 mode: --/CR res:
> 381225206/1751756546 bits 0x3 rrc: 64 type: IBT flags: 0x4004000 remote:
> 0x0 expref: -99 pid: 5071 timeout: 0
> Jan 24 23:26:29 mds kernel: Lustre:
> 5033:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
> 58837d54-5d72-cecb-593b-e7d78c9c54d3 reconnecting
> Jan 24 23:27:15 mds kernel: LustreError:
> 5101:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed
> out (enqueued at 1359066235, 200s ago); not entering recovery in server
> code, just going back to sleep ns: mds-scratch-MDT0000_UUID lock:
> ffff88090f7c7600/0x48997a721e1e8189 lrc: 3/1,0 mode: --/CR res:
> 381225206/1751756546 bits 0x3 rrc: 67 type: IBT flags: 0x4004000 remote:
> 0x0 expref: -99 pid: 5101 timeout: 0
> Jan 24 23:27:18 mds kernel: Lustre:
> 23684:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
> 141ba2f1-5a32-282b-1b9f-fde05c0caa1a reconnecting
> Jan 24 23:27:22 mds kernel: Lustre:
> 5099:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
> e57b39b0-2e26-0e11-3995-e801a3fb4ce3 reconnecting
> Jan 24 23:27:24 mds kernel: Lustre:
> 23672:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
> 72b809ed-de60-1c38-f17e-145a2294c6db reconnecting
> Jan 24 23:27:24 mds kernel: Lustre:
> 23672:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 1 previous
> similar message
> Jan 24 23:27:26 mds kernel: LustreError:
> 23665:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed
> out (enqueued at 1359066246, 200s ago); not entering recovery in server
> code, just going back to sleep ns: mds-scratch-MDT0000_UUID lock:
> ffff8808c4952000/0x48997a721e2588eb lrc: 3/1,0 mode: --/CR res:
> 381225206/1751756546 bits 0x3 rrc: 67 type: IBT flags: 0x4004000 remote:
> 0x0 expref: -99 pid: 23665 timeout: 0
> Jan 24 23:27:29 mds kernel: Lustre:
> 5108:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
> c9029114-746d-964e-8712-906a310f598f reconnecting
> Jan 24 23:27:29 mds kernel: Lustre:
> 5108:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 1 previous
> similar message
> Jan 24 23:27:38 mds kernel: Lustre:
> 5032:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
> d3e7f205-2e08-6614-3bf0-c2344ea5bf09 reconnecting
> Jan 24 23:27:38 mds kernel: Lustre:
> 5032:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 2 previous
> similar messages
> Jan 24 23:27:56 mds kernel: Lustre:
> 5066:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
> 4e67a94e-7e0d-4b29-6272-587a72de2498 reconnecting
> Jan 24 23:27:56 mds kernel: Lustre:
> 5066:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 10 previous
> similar messages
> Jan 24 23:28:36 mds kernel: Lustre:
> 23686:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
> f5637638-bf4a-320a-6967-7cffaea36273 reconnecting
> Jan 24 23:28:36 mds kernel: Lustre:
> 23686:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 8 previous
> similar messages
> Jan 24 23:28:36 mds kernel: Lustre:
> 5039:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any
> time (5/-383), not sending early reply
> Jan 24 23:28:36 mds kernel:   req at ffff88054a4ea400 x1425048289875237/t0
> o101->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
> lens 1024/3680 e 4 to 0 dl 1359066521 ref 2 fl Interpret:/0/0 rc 0/0

Jan 24 23:30:54 mds kernel: Lustre:
> 5023:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
> reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1@o2ib to
> 0xffff880beb978200; still busy with 1 active RPCs
> Jan 24 23:30:54 mds kernel: LustreError:
> 5023:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16)
>  req at ffff880425d99c00 x1425048289890866/t0
> o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
> lens 368/264 e 0 to 0 dl 1359066754 ref 1 fl Interpret:/0/0 rc -16/0
> Jan 24 23:31:01 mds kernel: Lustre:
> 5124:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
> reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1@o2ib to
> 0xffff880beb978200; still busy with 1 active RPCs
> Jan 24 23:31:01 mds kernel: LustreError:
> 5124:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16)
>  req at ffff88081b5c4000 x1425048289890867/t0
> o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
> lens 368/264 e 0 to 0 dl 1359066761 ref 1 fl Interpret:/0/0 rc -16/0
> Jan 24 23:31:08 mds kernel: Lustre:
> 5086:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
> reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1@o2ib to
> 0xffff880beb978200; still busy with 1 active RPCs
> Jan 24 23:31:08 mds kernel: LustreError:
> 5086:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16)
>  req at ffff8806213cfc00 x1425048289891170/t0
> o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
> lens 368/264 e 0 to 0 dl 1359066768 ref 1 fl Interpret:/0/0 rc -16/0
> Jan 24 23:31:15 mds kernel: Lustre:
> 5081:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
> reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1@o2ib to
> 0xffff880beb978200; still busy with 1 active RPCs
> Jan 24 23:31:15 mds kernel: LustreError:
> 5081:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16)
>  req at ffff88011d249000 x1425048289891171/t0
> o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
> lens 368/264 e 0 to 0 dl 1359066775 ref 1 fl Interpret:/0/0 rc -16/0
> Jan 24 23:31:22 mds kernel: Lustre:
> 5017:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
> reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1@o2ib to
> 0xffff880beb978200; still busy with 1 active RPCs
> Jan 24 23:31:22 mds kernel: LustreError:
> 5017:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16)
>  req at ffff8805d32c3000 x1425048289891172/t0
> o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
> lens 368/264 e 0 to 0 dl 1359066782 ref 1 fl Interpret:/0/0 rc -16/0
> Jan 24 23:31:29 mds kernel: Lustre:
> 5040:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
> reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1@o2ib to
> 0xffff880beb978200; still busy with 1 active RPCs
> Jan 24 23:31:29 mds kernel: LustreError:
> 5040:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16)
>  req at ffff8803a768a800 x1425048289891173/t0
> o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
> lens 368/264 e 0 to 0 dl 1359066789 ref 1 fl Interpret:/0/0 rc -16/0
> Jan 24 23:31:43 mds kernel: Lustre:
> 5039:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
> reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1@o2ib to
> 0xffff880beb978200; still busy with 1 active RPCs
> Jan 24 23:31:43 mds kernel: Lustre:
> 5039:0:(ldlm_lib.c:872:target_handle_connect()) Skipped 1 previous similar
> message
> Jan 24 23:31:43 mds kernel: LustreError:
> 5039:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16)
>  req at ffff88040fe6b000 x1425048289891477/t0
> o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
> lens 368/264 e 0 to 0 dl 1359066803 ref 1 fl Interpret:/0/0 rc -16/0
> Jan 24 23:31:43 mds kernel: LustreError:
> 5039:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar
> message
> Jan 24 23:31:50 mds kernel: Lustre:
> 23682:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
> ee75239f-4115-3c1d-a11a-7649df079499 reconnecting
> Jan 24 23:31:50 mds kernel: Lustre:
> 23682:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 22 previous
> similar messages


I was wondering whether a failover of the MDT and then reboot the MDS would
solve the problem.

If you have any idea or proposal I would be more than happy to hear,
Stelios.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20130125/260b16cc/attachment.htm>


More information about the lustre-discuss mailing list