[lustre-discuss] LNet: Service thread pid 37786 was inactive for 200.65s. The thread might be hung, or it might only be slow and will resume later

李子岩 liziyanmail at 126.com
Sat Nov 20 04:08:35 PST 2021




Hi All:
The filesystem is hung about 10 minutes, automatic recovery.
OS is CentOS7.6
Lustre version is 2.12.4
Network is Intel Omni-Path


Mds' log:
Nov 19 16:54:17 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Nov 19 16:54:17 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
Nov 19 16:54:17 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.18.0.58 at o2ib (1): c: 0, oc: 0, rc: 32
Nov 19 16:54:17 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 3 previous similar messages
Nov 19 16:54:17 hwmds1 kernel: LNetError: 25082:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.18.0.60 at o2ib added to recovery queue. Health = 900
Nov 19 16:54:17 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 10.0.5.5 at tcp1, removing former export from same NID
Nov 19 16:54:23 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 19 16:54:23 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.18.0.58 at o2ib (6): 
c: 0, oc: 1, rc: 32
Nov 19 16:54:23 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) no route to 10.0.5.4 at tcp1 from 172
.18.0.60 at o2ib
Nov 19 16:54:23 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) Skipped 31 previous similar messag
es
Nov 19 16:54:23 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0
.5.4 at tcp1: -113
Nov 19 16:54:23 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 31 previous similar mes
sages
Nov 19 16:54:29 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 19 16:54:29 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.18.0.58 at o2ib (6): 
c: 0, oc: 0, rc: 32
Nov 19 16:54:29 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) no route to 10.0.5.3 at tcp1 from 172
.18.0.60 at o2ib
Nov 19 16:54:29 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) Skipped 31 previous similar messag
es
Nov 19 16:54:29 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0
.5.3 at tcp1: -113
Nov 19 16:54:29 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 31 previous similar mes
sages
Nov 19 16:54:41 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Nov 19 16:54:41 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.18.0.58 at o2ib (5): 
c: 0, oc: 0, rc: 32
Nov 19 16:54:41 hwmds1 kernel: LNetError: 25082:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.18.0.60 at o2ib added to recovery q
ueue. Health = 900
Nov 19 16:54:41 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) no route to 10.0.5.3 at tcp1 from 172
.18.0.60 at o2ib
Nov 19 16:54:41 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) Skipped 30 previous similar messag
es
Nov 19 16:54:41 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0
.5.3 at tcp1: -113
Nov 19 16:54:41 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 30 previous similar mes
sages
Nov 19 16:54:43 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 10.0.5.6 at tcp1, removing former export from same NID
Nov 19 16:54:43 hwmds1 kernel: Lustre: MGS: Connection restored to 427c85e3-c0dd-55df-867f-180b784b009a (at 10.0.5.6 at tcp1)
Nov 19 16:54:43 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:54:53 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 19 16:54:53 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
Nov 19 16:54:53 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.18.0.58 at o2ib (6): 
c: 0, oc: 0, rc: 32
Nov 19 16:54:53 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 3 previous similar messages
Nov 19 16:54:53 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) no route to 10.0.5.4 at tcp1 from 172
.18.0.60 at o2ib
Nov 19 16:54:53 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) Skipped 31 previous similar messag
es
Nov 19 16:54:53 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0
.5.4 at tcp1: -113
Nov 19 16:54:53 hwmds1 kernel: LNetError: 25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 31 previous similar mes
sages
Nov 19 16:55:00 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 852ed01f-1ce7-4684-ff85-f72325b5891f (at 10.0.5.8 at tcp1) reconnecting
Nov 19 16:55:00 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:55:00 hwmds1 kernel: Lustre: sjtu-MDT0000: Connection restored to bbfdb2d3-9c4e-73f3-49cd-f7c519481608 (at 10.0.5.8 at tcp1)
Nov 19 16:55:00 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:55:06 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 8a066f59-a368-98c0-39fb-3e7ecdccf7c3 (at 10.0.5.4 at tcp1) reconnecting
Nov 19 16:55:06 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 10.0.5.4 at tcp1, removing former export from same NID
Nov 19 16:55:06 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:55:08 hwmds1 kernel: Lustre: sjtu-MDT0000: Connection restored to 427c85e3-c0dd-55df-867f-180b784b009a (at 10.0.5.6 at tcp1)
Nov 19 16:55:08 hwmds1 kernel: Lustre: Skipped 4 previous similar messages
Nov 19 16:55:13 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 19 16:55:13 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 6 previous similar messages
Nov 19 16:55:13 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.18.0.58 at o2ib (4):
c: 0, oc: 1, rc: 32
Nov 19 16:55:13 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 6 previous similar messages
Nov 19 16:55:33 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 10.0.5.6 at tcp1, removing former export from same NID
Nov 19 16:55:33 hwmds1 kernel: Lustre: Skipped 2 previous similar messages
Nov 19 16:55:33 hwmds1 kernel: Lustre: MGS: Connection restored to 427c85e3-c0dd-55df-867f-180b784b009a (at 10.0.5.6 at tcp1)
Nov 19 16:55:33 hwmds1 kernel: Lustre: Skipped 2 previous similar messages
Nov 19 16:55:47 hwmds1 kernel: LustreError: 37012:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s:
 evicting client at 10.0.5.4 at tcp1  ns: mdt-sjtu-MDT0000_UUID lock: ffff970132f7bcc0/0xc6adca8fa239d9b9 lrc: 4/0,0 mode: PR/PR res: [0
x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 699 type: IBT flags: 0x60200400000020 nid: 10.0.5.4 at tcp1 remote: 0x1c8f6a4e7b5421fc expref
: 11682370 pid: 11858 timeout: 2794572 lvb_type: 0
Nov 19 16:55:49 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 19 16:55:49 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 11 previous similar messages
Nov 19 16:55:49 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.18.0.58 at o2ib (6): 
c: 0, oc: 0, rc: 32
Nov 19 16:55:49 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 11 previous similar messages
Nov 19 16:55:53 hwmds1 kernel: LNetError: 25121:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 172.18.0.58 at o2ib added to
 recovery queue. Health = 900
Nov 19 16:55:57 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 10.0.5.4 at tcp1, removing former export from same NID
Nov 19 16:55:57 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:55:58 hwmds1 kernel: Lustre: sjtu-MDT0000: Client eba02ec3-4f91-bc12-39d7-32641f8e78d0 (at 10.0.5.6 at tcp1) reconnecting
Nov 19 16:55:58 hwmds1 kernel: Lustre: Skipped 3 previous similar messages
Nov 19 16:56:23 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 10.0.5.6 at tcp1, removing former export from same NID
Nov 19 16:56:23 hwmds1 kernel: Lustre: Skipped 2 previous similar messages
Nov 19 16:56:23 hwmds1 kernel: Lustre: MGS: Connection restored to 427c85e3-c0dd-55df-867f-180b784b009a (at 10.0.5.6 at tcp1)
Nov 19 16:56:23 hwmds1 kernel: Lustre: Skipped 5 previous similar messages
Nov 19 16:56:24 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 852ed01f-1ce7-4684-ff85-f72325b5891f (at 10.0.5.8 at tcp1) reconnecting
Nov 19 16:56:30 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.18.0.58 at o2ib: 0 seconds
Nov 19 16:56:48 hwmds1 kernel: Lustre: sjtu-MDT0000: Client eba02ec3-4f91-bc12-39d7-32641f8e78d0 (at 10.0.5.6 at tcp1) reconnecting
Nov 19 16:56:56 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 19 16:56:56 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 12 previous similar messages
Nov 19 16:56:56 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.18.0.58 at o2ib (5): 
c: 0, oc: 0, rc: 32
Nov 19 16:56:56 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 12 previous similar messages
Nov 19 16:57:14 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 10.0.5.6 at tcp1, removing former export from same NID
Nov 19 16:57:14 hwmds1 kernel: Lustre: Skipped 4 previous similar messages
Nov 19 16:57:28 hwmds1 kernel: LNet: Service thread pid 37786 was inactive for 200.65s. The thread might be hung, or it might only be
 slow and will resume later. Dumping the stack trace for debugging purposes:
Nov 19 16:57:28 hwmds1 kernel: Pid: 37786, comm: mdt00_004 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020
Nov 19 16:57:28 hwmds1 kernel: Call Trace:
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0f60c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0f617b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc140c5cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc140cc50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc140de0a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc1415fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc1412d35>] mdt_intent_policy+0x435/0xd80 [mdt]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0f47e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
:Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0f70516>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0ff8bb2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0fff9ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0fa447b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0fa7de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffff9b6c61f1>] kthread+0xd1/0xe0
Nov 19 16:57:28 hwmds1 kernel: [<ffffffff9bd8dd1d>] ret_from_fork_nospec_begin+0x7/0x21
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Nov 19 16:57:28 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1637312248.37786
Nov 19 16:57:29 hwmds1 kernel: LNet: Service thread pid 37928 was inactive for 200.87s. The thread might be hung, or it might only be
 slow and will resume later. Dumping the stack trace for debugging purposes:
Nov 19 16:57:29 hwmds1 kernel: Pid: 37928, comm: mdt01_006 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020
Nov 19 16:57:29 hwmds1 kernel: Call Trace:
Nov 19 16:57:29 hwmds1 kernel: LNet: Service thread pid 486 was inactive for 200.88s. Watchdog stack traces are limited to 3 per 300 
seconds, skipping this one.
Nov 19 16:57:29 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1637312249.42759
Nov 19 16:57:30 hwmds1 kernel: LNet: Service thread pid 6301 was inactive for 200.16s. Watchdog stack traces are limited to 3 per 300
 seconds, skipping this one.
Nov 19 16:57:30 hwmds1 kernel: LNet: Skipped 143 previous similar messages
Nov 19 16:57:30 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1637312250.6301
Nov 19 16:57:31 hwmds1 kernel: LNet: Service thread pid 6383 was inactive for 200.42s. Watchdog stack traces are limited to 3 per 300
 seconds, skipping this one.
Nov 19 16:57:31 hwmds1 kernel: LNet: Skipped 22 previous similar messages
Nov 19 16:57:31 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1637312251.6383
Nov 19 16:57:32 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1637312252.6336
Nov 19 16:57:33 hwmds1 kernel: LNet: Service thread pid 42111 was inactive for 200.16s. Watchdog stack traces are limited to 3 per 30
0 seconds, skipping this one.
Nov 19 16:57:33 hwmds1 kernel: LNet: Skipped 28 previous similar messages
Nov 19 16:57:33 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1637312253.42111
Nov 19 16:57:34 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1637312254.6297
Nov 19 16:57:35 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1637312255.6258
Nov 19 16:57:36 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1637312256.6315
Nov 19 16:57:37 hwmds1 kernel: LNet: Service thread pid 6391 was inactive for 200.26s. Watchdog stack traces are limited to 3 per 300
 seconds, skipping this one.
Nov 19 16:57:37 hwmds1 kernel: LNet: Skipped 32 previous similar messages
Nov 19 16:57:37 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1637312257.6391
Nov 19 16:57:37 hwmds1 kernel: Lustre: MGS: Connection restored to d602f485-3852-99ed-1518-2f5a73c38bd9 (at 10.0.5.4 at tcp1)
Nov 19 16:57:37 hwmds1 kernel: Lustre: Skipped 9 previous similar messages
Nov 19 16:58:04 hwmds1 kernel: Lustre: sjtu-MDT0000: Client eba02ec3-4f91-bc12-39d7-32641f8e78d0 (at 10.0.5.6 at tcp1) reconnecting
Nov 19 16:58:04 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:58:29 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 10.0.5.8 at tcp1, removing former export from same NID
Nov 19 16:58:29 hwmds1 kernel: Lustre: Skipped 6 previous similar messages
Nov 19 16:59:07 hwmds1 kernel: LustreError: 6237:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued a
t 1637312047, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff96c81f715
f80/0xc6adca8fa23ea12a lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 700 type: IBT flags: 0x40210000000000
 nid: local remote: 0x0 expref: -99 pid: 6237 timeout: 0 lvb_type: 0
Nov 19 16:59:07 hwmds1 kernel: LustreError: 6237:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 4 previous similar mes
sages
Nov 19 16:59:07 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 9 seconds
Nov 19 16:59:07 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 26 previous similar messages
Nov 19 16:59:07 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.18.0.58 at o2ib (14):
 c: 0, oc: 0, rc: 32
Nov 19 16:59:07 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 26 previous similar messages
Nov 19 16:59:07 hwmds1 kernel: LustreError: 6253:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued a:
t 1637312047, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9709bd320
480/0xc6adca8fa23f872e lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 699 type: IBT flags: 0x40210000000000
 nid: local remote: 0x0 expref: -99 pid: 6253 timeout: 0 lvb_type: 0
Nov 19 16:59:07 hwmds1 kernel: LustreError: 6253:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 70 previous similar me
ssages
Nov 19 16:59:08 hwmds1 kernel: LustreError: 42188:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued 
at 1637312048, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff96ed996a
5b00/0xc6adca8fa2414f0e lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 699 type: IBT flags: 0x4021000000000
0 nid: local remote: 0x0 expref: -99 pid: 42188 timeout: 0 lvb_type: 0
Nov 19 16:59:08 hwmds1 kernel: LustreError: 42188:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 29 previous similar m
essages
Nov 19 16:59:10 hwmds1 kernel: LustreError: 6298:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued a
t 1637312050, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff970dc9331
d40/0xc6adca8fa2455cb9 lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 699 type: IBT flags: 0x40210000000000
 nid: local remote: 0x0 expref: -99 pid: 6298 timeout: 0 lvb_type: 0
Nov 19 16:59:10 hwmds1 kernel: LustreError: 6298:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 71 previous similar me
ssages
Nov 19 16:59:15 hwmds1 kernel: LustreError: 6335:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued a
t 1637312055, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff96c732bd4
fc0/0xc6adca8fa247ba24 lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 699 type: IBT flags: 0x40210000000000
 nid: local remote: 0x0 expref: -99 pid: 6335 timeout: 0 lvb_type: 0
Nov 19 16:59:15 hwmds1 kernel: LustreError: 6335:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 28 previous similar me
ssages
Nov 19 16:59:19 hwmds1 kernel: Lustre: sjtu-MDT0000: Client eba02ec3-4f91-bc12-39d7-32641f8e78d0 (at 10.0.5.6 at tcp1) reconnecting
Nov 19 16:59:19 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:59:39 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.18.0.58 at o2ib: 2 seconds
Nov 19 16:59:39 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 9 previous similar messages
Nov 19 17:00:10 hwmds1 kernel: Lustre: MGS: Connection restored to bbfdb2d3-9c4e-73f3-49cd-f7c519481608 (at 10.0.5.8 at tcp1)
Nov 19 17:00:10 hwmds1 kernel: Lustre: Skipped 13 previous similar messages
Nov 19 17:00:36 hwmds1 kernel: Lustre: sjtu-MDT0000: haven't heard from client 37d3a752-14aa-0392-4653-9c396f5f2f4e (at 10.0.5.3 at tcp1
) in 227 seconds. I think it's dead, and I am evicting it. exp ffff96afbaa27400, cur 1637312436 expire 1637312286 last 1637312209
Nov 19 17:00:48 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.18.0.58 at o2ib: 3 seconds
Nov 19 17:00:48 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 31 previous similar messages
Nov 19 17:00:58 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 10.0.5.4 at tcp1, removing former export from same NID
Nov 19 17:00:58 hwmds1 kernel: Lustre: Skipped 5 previous similar messages
Nov 19 17:01:31 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.18.0.58 at o2ib: 1 seconds
Nov 19 17:01:31 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 31 previous similar messages
Nov 19 17:01:50 hwmds1 kernel: Lustre: sjtu-MDT0000: Client eba02ec3-4f91-bc12-39d7-32641f8e78d0 (at 10.0.5.6 at tcp1) reconnecting
Nov 19 17:01:50 hwmds1 kernel: Lustre: Skipped 2 previous similar messages
Nov 19 17:02:51 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.18.0.58 at o2ib: 0 seconds
Nov 19 17:02:51 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 31 previous similar messages
Nov 19 17:03:23 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Nov 19 17:03:23 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 34 previous similar messages
Nov 19 17:03:23 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.18.0.58 at o2ib (8): 
c: 0, oc: 0, rc: 32
Nov 19 17:03:23 hwmds1 kernel: LNetError: 25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 34 previous similar messages
Nov 19 17:03:23 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.18.0.58 at o2ib: 3 seconds
Nov 19 17:03:23 hwmds1 kernel: LNet: 25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 31 previous similar messages
Nov 19 17:03:32 hwmds1 kernel: Lustre: sjtu-MDT0000: haven't heard from client 852ed01f-1ce7-4684-ff85-f72325b5891f (at 10.0.5.8 at tcp1
) in 227 seconds. I think it's dead, and I am evicting it. exp ffff970e82d5e000, cur 1637312612 expire 1637312462 last 1637312385
Nov 19 17:03:35 hwmds1 kernel: LNet: Service thread pid 6322 completed after 558.36s. This indicates the system was overloaded (too m
any service threads, or there were not enough hardware resources).
Nov 19 17:03:35 hwmds1 kernel: LNet: Skipped 3 previous similar messages
Nov 19 17:03:35 hwmds1 kernel: LustreError: 6337:0:(service.c:2128:ptlrpc_server_handle_request()) @@@ Dropping timed-out request fro
m 12345-10.0.5.4 at tcp1: deadline 100:358s ago#012  req at ffff96b20f0a6300 x1714140739458176/t0(0) o38-><?>@<?>:0/0 lens 520/0 e 0 to 0 d
l 1637312257 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Nov 19 17:03:35 hwmds1 kernel: Lustre: 6389:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated 
(100:357s); client may timeout.  req at ffff96bf13b18d80 x1713999142500672/t0(0) o38-><?>@<?>:0/0 lens 520/0 e 0 to 0 dl 1637312258 ref 
1 fl Interpret:/0/ffffffff rc 0/-1
Nov 19 17:03:35 hwmds1 kernel: Lustre: 6389:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 1 previous similar message
Nov 19 17:03:35 hwmds1 kernel: LustreError: 6337:0:(service.c:2128:ptlrpc_server_handle_request()) Skipped 2 previous similar message
s
Nov 19 17:03:35 hwmds1 kernel: LustreError: 51417:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.0.5.8 at tcp1 arrived 
at 1637312615 with bad export cookie 14316320842890482053
Nov 19 17:03:40 hwmds1 kernel: LNetError: 25082:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.18.0.60 at o2ib added to recovery q
ueue. Health = 900
Nov 19 17:03:44 hwmds1 kernel: Lustre: 6281:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent del
ay: [sent 1637312617/real 0]  req at ffff96ba37651680 x1713962196237696/t0(0) o104->sjtu-MDT0000 at 10.0.5.4@tcp1:15/16 lens 296/224 e 0 to
 1 dl 1637312624 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Nov 19 17:03:44 hwmds1 kernel: Lustre: 6281:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Nov 19 17:03:51 hwmds1 kernel: LNetError: 25082:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.18.0.60 at o2ib added to recovery q
ueue. Health = 900
Nov 19 17:04:38 hwmds1 kernel: LNetError: 25121:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 172.18.0.58 at o2ib added to
 recovery queue. Health = 900
Nov 19 17:04:47 hwmds1 kernel: Lustre: sjtu-MDT0000: Connection restored to bbfdb2d3-9c4e-73f3-49cd-f7c519481608 (at 10.0.5.8 at tcp1)






 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20211120/74a6769a/attachment-0001.html>


More information about the lustre-discuss mailing list