[lustre-discuss] (no subject)

Alastair Basden a.g.basden at durham.ac.uk
Tue May 17 09:02:08 PDT 2022


Hi all,

We had a problem with one of our MDS (ldiskfs) on Lustre 2.12.6, which we 
think is a bug - but haven't been able to identify it.  Can anyone shed 
any light?  We unmounted and remounted the mdt at around 23:00.

Client logs:
May 16 22:15:41 m8011 kernel: LustreError: 11-0: lustrefs8-MDT0000-mdc-ffff956fb73c3800: operation ldlm_enqueue to node 172.18.185.1 at o2ib failed: rc = -107
May 16 22:15:41 m8011 kernel: Lustre: lustrefs8-MDT0000-mdc-ffff956fb73c3800: Connection to lustrefs8-MDT0000 (at 172.18.185.1 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
May 16 22:15:41 m8011 kernel: LustreError: Skipped 5 previous similar messages
May 16 22:15:48 m8011 kernel: Lustre: 101710:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1652735641/real 1652735641]  req at ffff949d8cb1de80 x1724290358528896/t0(0) o101->lustrefs8-MDT0000-mdc-ffff956fb73c3800 at 172.18.185.1@o2ib:12/10 lens 480/568 e 4 to 1 dl 1652735748 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
May 16 22:15:48 m8011 kernel: Lustre: 101710:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
May 16 23:00:15 m8011 kernel: Lustre: 4784:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1652738408/real 1652738408]  req at ffff94ea07314380 x1724290358763776/t0(0) o400->MGC172.18.185.1 at o2ib@172.18.185.1 at o2ib:26/25 lens 224/224 e 0 to 1 dl 1652738415 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 16 23:00:15 m8011 kernel: LustreError: 166-1: MGC172.18.185.1 at o2ib: Connection to MGS (at 172.18.185.1 at o2ib) was lost; in progress operations using this service will fail
May 16 23:00:15 m8011 kernel: Lustre: Evicted from MGS (at MGC172.18.185.1 at o2ib_0) after server handle changed from 0xdb7c7c778c8908d6 to 0xdb7c7cbad3be9e79
May 16 23:00:15 m8011 kernel: Lustre: MGC172.18.185.1 at o2ib: Connection restored to MGC172.18.185.1 at o2ib_0 (at 172.18.185.1 at o2ib)
May 16 23:01:49 m8011 kernel: LustreError: 167-0: lustrefs8-MDT0000-mdc-ffff956fb73c3800: This client was evicted by lustrefs8-MDT0000; in progress operations using this service will fail.
May 16 23:01:49 m8011 kernel: LustreError: 101719:0:(vvp_io.c:1562:vvp_io_init()) lustrefs8: refresh file layout [0x200008107:0x9b08:0x0] error -108.
May 16 23:01:49 m8011 kernel: LustreError: 101719:0:(vvp_io.c:1562:vvp_io_init()) Skipped 3 previous similar messages
May 16 23:01:49 m8011 kernel: Lustre: lustrefs8-MDT0000-mdc-ffff956fb73c3800: Connection restored to 172.18.185.1 at o2ib (at 172.18.185.1 at o2ib)



MDS server logs:
May 16 22:15:40 c8mds1 kernel: LustreError: 10686:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 172.18.181.11 at o2ib  ns: mdt-lustrefs8-MDT0000_UUID lock: ffff97b3730d98c0/0xdb7c7cbad3be1c7b lrc: 3/0,0 mode: PW/PW res: [0x200009119:0x327f:0x0].0x0 bits 0x40/0x0 rrc: 201 type: IBT flags: 0x60200400000020 nid: 172.18.181.11 at o2ib remote: 0xe62e31610edfb808 expref: 90 pid: 10707 timeout: 8482830 lvb_type: 0
May 16 22:15:40 c8mds1 kernel: LustreError: 10712:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) ### lock on destroyed export ffff9769eaf46c00 ns: mdt-lustrefs8-MDT0000_UUID lock: ffff97d828635e80/0xdb7c7cbad3be1c90 lrc: 3/0,0 mode: PW/PW res: [0x200009119:0x327f:0x0].0x0 bits 0x40/0x0 rrc: 199 type: IBT flags: 0x50200400000020 nid: 172.18.181.11 at o2ib remote: 0xe62e31610edfb80f expref: 77 pid: 10712 timeout: 0 lvb_type: 0
May 16 22:15:40 c8mds1 kernel: LustreError: 10712:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) Skipped 27 previous similar messages
May 16 22:17:22 c8mds1 kernel: LNet: Service thread pid 10783 was inactive for 200.73s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
May 16 22:17:22 c8mds1 kernel: LNet: Skipped 3 previous similar messages
May 16 22:17:22 c8mds1 kernel: Pid: 10783, comm: mdt01_040 3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
May 16 22:17:22 c8mds1 kernel: Call Trace:
May 16 22:17:22 c8mds1 kernel: [<ffffffffc1b160c0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc1b180f1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc0e1969b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc0e19d20>] mdt_object_lock_internal+0x70/0x360 [mdt]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc0e1a030>] mdt_object_lock+0x20/0x30 [mdt]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc0e5966b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc0e074bf>] mdt_intent_brw+0x1f/0x30 [mdt]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc0e20745>] mdt_intent_policy+0x435/0xd80 [mdt]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc1afd4e6>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc1b24b16>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc1bae102>] tgt_enqueue+0x62/0x210 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc1bb4f1a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc1b5988b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [<ffffffffc1b5d1f4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [<ffffffff904c5c21>] kthread+0xd1/0xe0
May 16 22:17:22 c8mds1 kernel: [<ffffffff90b93de4>] ret_from_fork_nospec_begin+0xe/0x21
May 16 22:17:22 c8mds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
May 16 22:17:22 c8mds1 kernel: LustreError: dumping log to /tmp/lustre-log.1652735842.10783
May 16 22:17:25 c8mds1 kernel: LNet: Service thread pid 10779 was inactive for 203.65s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
May 16 22:17:25 c8mds1 kernel: Pid: 10779, comm: mdt01_036 3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
May 16 22:17:25 c8mds1 kernel: Call Trace:
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b160c0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b180f1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e1969b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e19d20>] mdt_object_lock_internal+0x70/0x360 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e1a030>] mdt_object_lock+0x20/0x30 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e5966b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e074bf>] mdt_intent_brw+0x1f/0x30 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e20745>] mdt_intent_policy+0x435/0xd80 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1afd4e6>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b24b16>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1bae102>] tgt_enqueue+0x62/0x210 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1bb4f1a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b5988b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b5d1f4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffff904c5c21>] kthread+0xd1/0xe0
May 16 22:17:25 c8mds1 kernel: [<ffffffff90b93de4>] ret_from_fork_nospec_begin+0xe/0x21
May 16 22:17:25 c8mds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
May 16 22:17:25 c8mds1 kernel: Pid: 10761, comm: mdt01_018 3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
May 16 22:17:25 c8mds1 kernel: Call Trace:
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b160c0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b180f1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e1969b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e19d20>] mdt_object_lock_internal+0x70/0x360 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e1a030>] mdt_object_lock+0x20/0x30 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e5966b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e074bf>] mdt_intent_brw+0x1f/0x30 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e20745>] mdt_intent_policy+0x435/0xd80 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1afd4e6>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b24b16>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1bae102>] tgt_enqueue+0x62/0x210 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1bb4f1a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b5988b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b5d1f4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffff904c5c21>] kthread+0xd1/0xe0
May 16 22:17:25 c8mds1 kernel: [<ffffffff90b93de4>] ret_from_fork_nospec_begin+0xe/0x21
May 16 22:17:25 c8mds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
May 16 22:17:25 c8mds1 kernel: Pid: 10744, comm: mdt01_001 3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
May 16 22:17:25 c8mds1 kernel: Call Trace:
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b160c0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b180f1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e1969b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e19d20>] mdt_object_lock_internal+0x70/0x360 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e1a030>] mdt_object_lock+0x20/0x30 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e5966b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e074bf>] mdt_intent_brw+0x1f/0x30 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e20745>] mdt_intent_policy+0x435/0xd80 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1afd4e6>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b24b16>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1bae102>] tgt_enqueue+0x62/0x210 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1bb4f1a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b5988b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b5d1f4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffff904c5c21>] kthread+0xd1/0xe0
May 16 22:17:25 c8mds1 kernel: [<ffffffff90b93de4>] ret_from_fork_nospec_begin+0xe/0x21
May 16 22:17:25 c8mds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
May 16 22:17:25 c8mds1 kernel: Pid: 10775, comm: mdt01_032 3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
May 16 22:17:25 c8mds1 kernel: Call Trace:
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b160c0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1b180f1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e1969b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e19d20>] mdt_object_lock_internal+0x70/0x360 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e1a030>] mdt_object_lock+0x20/0x30 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e5966b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e074bf>] mdt_intent_brw+0x1f/0x30 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc0e20745>] mdt_intent_policy+0x435/0xd80 [mdt]
May 16 22:17:25 c8mds1 kernel: [<ffffffffc1afd4e6>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
May 16 22:17:26 c8mds1 kernel: [<ffffffffc1b24b16>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
May 16 22:17:26 c8mds1 kernel: [<ffffffffc1bae102>] tgt_enqueue+0x62/0x210 [ptlrpc]
May 16 22:17:26 c8mds1 kernel: [<ffffffffc1bb4f1a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
May 16 22:17:26 c8mds1 kernel: [<ffffffffc1b5988b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
May 16 22:17:26 c8mds1 kernel: [<ffffffffc1b5d1f4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
May 16 22:17:26 c8mds1 kernel: [<ffffffff904c5c21>] kthread+0xd1/0xe0
May 16 22:17:26 c8mds1 kernel: [<ffffffff90b93de4>] ret_from_fork_nospec_begin+0xe/0x21
May 16 22:17:26 c8mds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
May 16 22:17:26 c8mds1 kernel: LNet: Service thread pid 10758 was inactive for 204.22s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
May 16 22:17:26 c8mds1 kernel: LNet: Skipped 86 previous similar messages
May 16 22:19:01 c8mds1 kernel: LustreError: 10731:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1652735641, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lustrefs8-MDT0000_UUID lock: ffff97e004e2c140/0xdb7c7cbad3be1cf9 lrc: 3/0,1 mode: --/PW res: [0x200009119:0x327f:0x0].0x0 bits 0x40/0x0 rrc: 199 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 10731 timeout: 0 lvb_type: 0
May 16 22:19:01 c8mds1 kernel: LustreError: 10731:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) Skipped 82 previous similar messages
May 16 22:19:13 c8mds1 kernel: LNet: Service thread pid 10712 was inactive for 212.17s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
May 16 22:19:13 c8mds1 kernel: LNet: Skipped 79 previous similar messages
May 16 22:19:13 c8mds1 kernel: LustreError: dumping log to /tmp/lustre-log.1652735953.10712
May 16 22:20:41 c8mds1 kernel: LustreError: 10712:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1652735740, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lustrefs8-MDT0000_UUID lock: ffff97d828631680/0xdb7c7cbad3be255e lrc: 3/0,1 mode: --/PW res: [0x200009119:0x327f:0x0].0x0 bits 0x40/0x0 rrc: 199 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 10712 timeout: 0 lvb_type: 0
May 16 22:20:41 c8mds1 kernel: LustreError: 10712:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) Skipped 6 previous similar messages
May 16 22:23:56 c8mds1 kernel: Lustre: 10741:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012  req at ffff978ac2779b00 x1724294641533504/t0(0) o101->70387729-37e6-0b1d-21e1-7361ddded0de at 172.18.182.20@o2ib:471/0 lens 608/0 e 24 to 0 dl 1652736241 ref 2 fl New:/0/ffffffff rc 0/-1
May 16 22:23:56 c8mds1 kernel: Lustre: 10741:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
May 16 22:24:02 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Client 5167509f-95db-2f03-61ce-247954999671 (at 172.18.181.64 at o2ib) reconnecting
May 16 22:24:02 c8mds1 kernel: Lustre: Skipped 3 previous similar messages
May 16 22:24:02 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to e1d625a8-a6a8-baed-7c20-96366af6623c (at 172.18.181.26 at o2ib)
May 16 22:24:03 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to 4d39a1d1-ecbe-d67d-ca2a-415e2a07a1bb (at 172.18.182.90 at o2ib)
May 16 22:24:03 c8mds1 kernel: Lustre: Skipped 63 previous similar messages
May 16 22:24:14 c8mds1 kernel: Lustre: 10741:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012  req at ffff97dfcf718480 x1724302352520256/t0(0) o101->d2a8736a-59c6-e6f5-15db-d12a0417e37e at 172.18.182.36@o2ib:489/0 lens 376/0 e 22 to 0 dl 1652736259 ref 2 fl New:/0/ffffffff rc 0/-1
May 16 22:24:14 c8mds1 kernel: Lustre: 10741:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 183 previous similar messages
May 16 22:24:20 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to 6827f755-94c1-722c-22ec-777c3630efe9 (at 172.18.182.36 at o2ib)
May 16 22:24:20 c8mds1 kernel: Lustre: Skipped 1 previous similar message
May 16 22:24:23 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to 02863bb5-0efe-30af-b758-5ea6d9c8f49e (at 172.18.180.22 at o2ib)
May 16 22:24:28 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to 21ce8ec3-244d-5053-20fa-257438c22965 (at 172.18.182.24 at o2ib)
May 16 22:24:41 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Client 8a587b98-f20f-7e04-716d-07a70d0ddc59 (at 172.18.182.66 at o2ib) reconnecting
May 16 22:24:41 c8mds1 kernel: Lustre: Skipped 63 previous similar messages
May 16 22:24:41 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to d2aff88b-77e8-31ce-8eff-d4eaf8ebc25f (at 172.18.182.66 at o2ib)
May 16 22:25:04 c8mds1 kernel: Lustre: 10711:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012  req at ffff98a4b61fe780 x1724308360403904/t0(0) o101->b8eae58d-c4c3-797e-0993-98d43037a3f6 at 172.18.182.9@o2ib:539/0 lens 2112/0 e 7 to 0 dl 1652736309 ref 2 fl New:/0/ffffffff rc 0/-1
May 16 22:25:04 c8mds1 kernel: Lustre: 10711:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 3 previous similar messages
May 16 22:25:10 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to c16da7b9-791e-3f2f-c782-3c0d58b634ff (at 172.18.182.9 at o2ib)
May 16 22:25:44 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to 266a4d82-5148-c97a-6950-49a1a34de653 (at 172.18.182.46 at o2ib)
May 16 22:25:44 c8mds1 kernel: Lustre: Skipped 1 previous similar message
May 16 22:25:46 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Client c91bd11c-13d6-526f-b70b-f3cdf9dad9f5 (at 172.18.181.249 at o2ib) reconnecting
May 16 22:25:46 c8mds1 kernel: Lustre: Skipped 3 previous similar messages
May 16 22:26:27 c8mds1 kernel: Lustre: 10711:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012  req at ffff9769d8be0480 x1726170743340288/t0(0) o101->91ea24e9-1911-fbf5-7956-efb46ee6e0ca at 172.18.181.244@o2ib:622/0 lens 376/0 e 3 to 0 dl 1652736392 ref 2 fl New:/0/ffffffff rc 0/-1
May 16 22:26:27 c8mds1 kernel: Lustre: 10711:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 3 previous similar messages
May 16 22:26:57 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to 6a5cfbfd-cf9d-6fb5-e6f4-e22032e176a7 (at 172.18.181.245 at o2ib)
May 16 22:26:57 c8mds1 kernel: Lustre: Skipped 2 previous similar messages
May 16 22:28:13 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Client 214637ef-4511-539a-9833-362852da07d8 (at 172.18.182.8 at o2ib) reconnecting
May 16 22:28:13 c8mds1 kernel: Lustre: Skipped 3 previous similar messages
May 16 22:29:19 c8mds1 kernel: Lustre: 10762:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012  req at ffff98cd090c5e80 x1724301745017408/t0(0) o101->a6a02916-73da-e7fe-c9ba-5beb38eb7ef9 at 172.18.182.74@o2ib:39/0 lens 2096/0 e 1 to 0 dl 1652736564 ref 2 fl New:/0/ffffffff rc 0/-1
May 16 22:29:19 c8mds1 kernel: Lustre: 10762:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 6 previous similar messages
May 16 22:29:25 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to bbc75da8-afe7-2412-e2b0-3aa0ce2c9280 (at 172.18.182.74 at o2ib)
May 16 22:29:25 c8mds1 kernel: Lustre: Skipped 3 previous similar messages
May 16 22:33:20 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Client c2742dcd-9705-b811-3cf8-59feadd3ef65 (at 172.18.181.58 at o2ib) reconnecting
May 16 22:33:20 c8mds1 kernel: Lustre: Skipped 4 previous similar messages
May 16 22:33:57 c8mds1 kernel: Lustre: 10741:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012  req at ffff97e095943a80 x1724312024844544/t0(0) o101->d4404db7-77c2-9b93-4062-23fa777df3f5 at 172.18.181.17@o2ib:317/0 lens 2120/0 e 0 to 0 dl 1652736842 ref 2 fl New:/2/ffffffff rc 0/-1
May 16 22:33:57 c8mds1 kernel: Lustre: 10741:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 4 previous similar messages
May 16 22:34:03 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to e1d625a8-a6a8-baed-7c20-96366af6623c (at 172.18.181.26 at o2ib)
May 16 22:34:03 c8mds1 kernel: Lustre: Skipped 3 previous similar messages
May 16 22:42:23 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Client 0d1b5384-3f3e-fd7e-4e8f-5a942b2cf1c3 (at 172.18.182.12 at o2ib) reconnecting
May 16 22:42:23 c8mds1 kernel: Lustre: Skipped 86 previous similar messages
May 16 22:42:39 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to 0d5dbd26-a3f4-83a8-6867-3ab763b89119 (at 172.18.181.58 at o2ib)
May 16 22:42:39 c8mds1 kernel: Lustre: Skipped 83 previous similar messages
May 16 22:42:52 c8mds1 kernel: Lustre: 10759:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply#012  req at ffff98c5db275100 x1724396809193344/t0(0) o101->4cd977ab-9c50-e1bb-5ddf-184d7f219c8f at 172.18.182.79@o2ib:97/0 lens 376/0 e 0 to 0 dl 1652737377 ref 2 fl New:/0/ffffffff rc 0/-1
May 16 22:42:52 c8mds1 kernel: Lustre: 10759:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 111 previous similar messages
May 16 22:52:42 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Client 38b54f7b-fdb8-09cc-c169-0d63b4bed5fd (at 172.18.182.13 at o2ib) reconnecting
May 16 22:52:42 c8mds1 kernel: Lustre: Skipped 101 previous similar messages
May 16 22:52:42 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Connection restored to 469e12ba-269a-69f4-056b-d22dc3ac5737 (at 172.18.182.13 at o2ib)
May 16 22:52:42 c8mds1 kernel: Lustre: Skipped 101 previous similar messages
May 16 22:53:59 c8mds1 kernel: Lustre: 10762:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012  req at ffff987345d20000 x1724286364171584/t0(0) o36->362ed117-2a64-a5a9-88ff-dd3c1fa876b4 at 172.18.181.60@o2ib:9/0 lens 488/0 e 0 to 0 dl 1652738044 ref 2 fl New:/2/ffffffff rc 0/-1
May 16 22:53:59 c8mds1 kernel: Lustre: 10762:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 125 previous similar messages
May 16 22:59:41 c8mds1 kernel: Lustre: 113553:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1652738375/real 1652738375]  req at ffff988780e1ba80 x1724210116907968/t0(0) o39->lustrefs8-MDT0000-lwp-MDT0000 at 0@lo:12/10 lens 224/224 e 0 to 1 dl 1652738381 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
May 16 22:59:41 c8mds1 kernel: Lustre: 113553:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 28 previous similar messages
May 16 22:59:42 c8mds1 kernel: Lustre: Failing over lustrefs8-MDT0000
May 16 22:59:42 c8mds1 kernel: LNet: Service thread pid 10731 completed after 2740.20s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
May 16 22:59:42 c8mds1 kernel: Lustre: 10731:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (106:2635s); client may timeout.  req at ffff9769d8bcc380 x1724290358529536/t0(0) o101->233a14d0-0bff-ec92-6b0f-433ad8b53245 at 172.18.181.11@o2ib:732/0 lens 568/192 e 4 to 0 dl 1652735747 ref 1 fl Interpret:/0/0 rc -19/-19
May 16 22:59:42 c8mds1 kernel: LustreError: 10696:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) ### lock on destroyed export ffff9769eaf46c00 ns: mdt-lustrefs8-MDT0000_UUID lock: ffff97e03ac62880/0xdb7c7cbad3be1d0e lrc: 3/0,0 mode: PW/PW res: [0x200009119:0x327f:0x0].0x0 bits 0x40/0x0 rrc: 196 type: IBT flags: 0x50200400000020 nid: 172.18.181.11 at o2ib remote: 0xe62e31610edfb839 expref: 13 pid: 10696 timeout: 0 lvb_type: 0
May 16 22:59:42 c8mds1 kernel: LustreError: 10696:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) Skipped 5 previous similar messages
May 16 22:59:42 c8mds1 kernel: LustreError: 10738:0:(service.c:2128:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-172.18.181.11 at o2ib: deadline 100:2542s ago#012  req at ffff97b0899b9b00 x1724290358535616/t0(0) o38-><?>@<?>:0/0 lens 520/0 e 0 to 0 dl 1652735840 ref 1 fl Interpret:/0/ffffffff rc 0/-1
May 16 22:59:42 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Not available for connect from 172.18.181.11 at o2ib (stopping)
May 16 22:59:42 c8mds1 kernel: Lustre: Skipped 1 previous similar message
May 16 22:59:42 c8mds1 kernel: Lustre: 10731:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 927 previous similar messages
May 16 22:59:42 c8mds1 kernel: LustreError: 10642:0:(client.c:1187:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff98b1c1922400 x1724210116914432/t0(0) o41->lustrefs8-MDT0002-osp-MDT0000 at 172.18.185.3@o2ib:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
May 16 22:59:42 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Not available for connect from 172.18.182.100 at o2ib (stopping)
May 16 22:59:42 c8mds1 kernel: Lustre: Skipped 128 previous similar messages
May 16 22:59:43 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Not available for connect from 172.18.121.96 at o2ib (stopping)
May 16 22:59:43 c8mds1 kernel: Lustre: Skipped 33 previous similar messages
May 16 22:59:44 c8mds1 kernel: LustreError: 69266:0:(ldlm_lockd.c:2366:ldlm_cancel_handler()) ldlm_cancel from 172.18.185.7 at o2ib arrived at 1652738384 with bad export cookie 15815652844318675611
May 16 22:59:44 c8mds1 kernel: LustreError: 69266:0:(ldlm_lockd.c:2366:ldlm_cancel_handler()) Skipped 1 previous similar message
May 16 22:59:45 c8mds1 kernel: LustreError: 137-5: lustrefs8-MDT0000_UUID: not available for connect from 172.18.181.68 at o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
May 16 22:59:45 c8mds1 kernel: LustreError: Skipped 1 previous similar message
May 16 22:59:45 c8mds1 kernel: LustreError: 137-5: lustrefs8-MDT0000_UUID: not available for connect from 172.18.121.44 at o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
May 16 22:59:45 c8mds1 kernel: LustreError: Skipped 18 previous similar messages
May 16 22:59:46 c8mds1 kernel: LustreError: 137-5: lustrefs8-MDT0000_UUID: not available for connect from 172.18.181.205 at o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
May 16 22:59:46 c8mds1 kernel: LustreError: Skipped 31 previous similar messages
May 16 22:59:48 c8mds1 kernel: LustreError: 73512:0:(ldlm_lockd.c:2366:ldlm_cancel_handler()) ldlm_cancel from 172.18.185.6 at o2ib arrived at 1652738388 with bad export cookie 15815652844318675548
May 16 22:59:48 c8mds1 kernel: LustreError: 73512:0:(ldlm_lockd.c:2366:ldlm_cancel_handler()) Skipped 5 previous similar messages
May 16 22:59:48 c8mds1 kernel: LustreError: 137-5: lustrefs8-MDT0000_UUID: not available for connect from 172.18.122.191 at o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
May 16 22:59:48 c8mds1 kernel: LustreError: Skipped 64 previous similar messages
May 16 22:59:51 c8mds1 kernel: Lustre: server umount lustrefs8-MDT0000 complete
May 16 23:00:10 c8mds1 kernel: LDISKFS-fs (drbd0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
May 16 23:00:11 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Imperative Recovery not enabled, recovery window 300-900
May 16 23:00:11 c8mds1 kernel: LustreError: 0-0: lustrefs8-MDT0000: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
May 16 23:00:11 c8mds1 kernel: Lustre: lustrefs8-MDT0000: in recovery but waiting for the first client to connect
May 16 23:00:12 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Will be in recovery for at least 5:00, or until 864 clients reconnect
May 16 23:00:17 c8mds1 kernel: LustreError: 113800:0:(tgt_handler.c:526:tgt_filter_recovery_request()) @@@ not permitted during recovery  req at ffff97df902a1f80 x1724236285455488/t0(0) o601->lustrefs8-MDT0000-lwp-OST0022_UUID at 172.18.185.9@o2ib:382/0 lens 336/0 e 0 to 0 dl 1652739172 ref 1 fl Interpret:/0/ffffffff rc 0/-1
May 16 23:00:17 c8mds1 kernel: LustreError: 113800:0:(tgt_handler.c:526:tgt_filter_recovery_request()) Skipped 2 previous similar messages
May 16 23:00:18 c8mds1 kernel: LustreError: 113688:0:(ldlm_lockd.c:2366:ldlm_cancel_handler()) ldlm_cancel from 172.18.185.9 at o2ib arrived at 1652738418 with bad export cookie 15815652844318675674
May 16 23:00:18 c8mds1 kernel: LustreError: 113688:0:(ldlm_lockd.c:2366:ldlm_cancel_handler()) Skipped 1 previous similar message
May 16 23:00:19 c8mds1 kernel: LustreError: 114119:0:(tgt_handler.c:526:tgt_filter_recovery_request()) @@@ not permitted during recovery  req at ffff97d6edba4c80 x1724236285460864/t0(0) o601->lustrefs8-MDT0000-lwp-OST001d_UUID at 172.18.185.9@o2ib:384/0 lens 336/0 e 0 to 0 dl 1652739174 ref 1 fl Interpret:/0/ffffffff rc 0/-1
May 16 23:00:19 c8mds1 kernel: LustreError: 114119:0:(tgt_handler.c:526:tgt_filter_recovery_request()) Skipped 4 previous similar messages
May 16 23:00:23 c8mds1 kernel: LustreError: 114128:0:(tgt_handler.c:526:tgt_filter_recovery_request()) @@@ not permitted during recovery  req at ffff97e0c04a1050 x1724236285465344/t0(0) o601->lustrefs8-MDT0000-lwp-OST001e_UUID at 172.18.185.9@o2ib:387/0 lens 336/0 e 0 to 0 dl 1652739177 ref 1 fl Interpret:/0/ffffffff rc 0/-1
May 16 23:00:23 c8mds1 kernel: LustreError: 114128:0:(tgt_handler.c:526:tgt_filter_recovery_request()) Skipped 4 previous similar messages
May 16 23:00:24 c8mds1 kernel: LustreError: 113691:0:(ldlm_lockd.c:2366:ldlm_cancel_handler()) ldlm_cancel from 172.18.185.7 at o2ib arrived at 1652738424 with bad export cookie 15815652844318675625
May 16 23:00:25 c8mds1 kernel: LustreError: 113802:0:(tgt_handler.c:526:tgt_filter_recovery_request()) @@@ not permitted during recovery  req at ffff98a7398e0480 x1724236297724160/t0(0) o601->lustrefs8-MDT0000-lwp-OST0026_UUID at 172.18.185.10@o2ib:390/0 lens 336/0 e 0 to 0 dl 1652739180 ref 1 fl Interpret:/0/ffffffff rc 0/-1
May 16 23:00:25 c8mds1 kernel: LustreError: 113802:0:(tgt_handler.c:526:tgt_filter_recovery_request()) Skipped 8 previous similar messages
May 16 23:00:29 c8mds1 kernel: LustreError: 114128:0:(tgt_handler.c:526:tgt_filter_recovery_request()) @@@ not permitted during recovery  req at ffff97d821e2f500 x1724235485488064/t0(0) o601->lustrefs8-MDT0000-lwp-OST0007_UUID at 172.18.185.6@o2ib:394/0 lens 336/0 e 0 to 0 dl 1652739184 ref 1 fl Interpret:/0/ffffffff rc 0/-1
May 16 23:00:29 c8mds1 kernel: LustreError: 114128:0:(tgt_handler.c:526:tgt_filter_recovery_request()) Skipped 35 previous similar messages
May 16 23:00:34 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Denying connection for new client 233a14d0-0bff-ec92-6b0f-433ad8b53245 (at 172.18.181.11 at o2ib), waiting for 864 known clients (581 recovered, 0 in progress, and 0 evicted) to recover in 14:12
May 16 23:00:34 c8mds1 kernel: LustreError: 113690:0:(ldlm_lockd.c:2366:ldlm_cancel_handler()) ldlm_cancel from 172.18.185.7 at o2ib arrived at 1652738434 with bad export cookie 15815652844318675632
May 16 23:00:34 c8mds1 kernel: LustreError: 113690:0:(ldlm_lockd.c:2366:ldlm_cancel_handler()) Skipped 3 previous similar messages
May 16 23:00:38 c8mds1 kernel: LustreError: 114121:0:(tgt_handler.c:526:tgt_filter_recovery_request()) @@@ not permitted during recovery  req at ffff97dfb1884800 x1724236032786624/t0(0) o601->lustrefs8-MDT0000-lwp-OST0019_UUID at 172.18.185.8@o2ib:403/0 lens 336/0 e 0 to 0 dl 1652739193 ref 1 fl Interpret:/0/ffffffff rc 0/-1
May 16 23:00:38 c8mds1 kernel: LustreError: 114121:0:(tgt_handler.c:526:tgt_filter_recovery_request()) Skipped 87 previous similar messages
May 16 23:00:59 c8mds1 kernel: LustreError: 113803:0:(tgt_handler.c:526:tgt_filter_recovery_request()) @@@ not permitted during recovery  req at ffff9870a787a880 x1724235698119168/t0(0) o601->lustrefs8-MDT0000-lwp-OST0001_UUID at 172.18.185.5@o2ib:424/0 lens 336/0 e 0 to 0 dl 1652739214 ref 1 fl Interpret:/0/ffffffff rc 0/-1
May 16 23:00:59 c8mds1 kernel: LustreError: 113803:0:(tgt_handler.c:526:tgt_filter_recovery_request()) Skipped 102 previous similar messages
May 16 23:00:59 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Denying connection for new client 233a14d0-0bff-ec92-6b0f-433ad8b53245 (at 172.18.181.11 at o2ib), waiting for 864 known clients (860 recovered, 1 in progress, and 0 evicted) to recover in 13:47
May 16 23:01:24 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Denying connection for new client 233a14d0-0bff-ec92-6b0f-433ad8b53245 (at 172.18.181.11 at o2ib), waiting for 864 known clients (860 recovered, 1 in progress, and 0 evicted) to recover in 13:22
May 16 23:01:30 c8mds1 kernel: Lustre: lustrefs8-MDT0000: Recovery over after 1:18, of 864 clients 864 recovered and 0 were evicted.

Thanks,
Alastair.


More information about the lustre-discuss mailing list