[lustre-discuss] lustre-discuss Digest, Vol 226, Issue 22

Ms. Megan Larko dobsonunit at gmail.com
Wed Jan 22 18:19:48 PST 2025


I would recommend trying a new/different know-good omnipath cable.   That should be a pretty rasy & quick test. 
Cheers,
Megan


> On Jan 21, 2025, at 5:38 AM, lustre-discuss-request at lists.lustre.org wrote:
> 
> Send lustre-discuss mailing list submissions to
>    lustre-discuss at lists.lustre.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> or, via email, send a message with subject or body 'help' to
>    lustre-discuss-request at lists.lustre.org
> 
> You can reach the person managing the list at
>    lustre-discuss-owner at lists.lustre.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of lustre-discuss digest..."
> 
> 
> Today's Topics:
> 
>   1. Errors in logs from one of our nodes (James A Allsopp)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 21 Jan 2025 10:33:33 +0000
> From: James A Allsopp <j.a.allsopp at sheffield.ac.uk>
> To: lustre-discuss at lists.lustre.org
> Subject: [lustre-discuss] Errors in logs from one of our nodes
> Message-ID:
>    <CAK2FDNP+u1Ami_HjKhQx_0x9f-+y2BdBmhyCOhFokz1Vy1scoQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hello,
> We've been having intermittent problems with our cluster. One node is
> reporting Lustre-related errors. The node connects to the Lustre filesystem
> over omnipath, and although we suspect this is the cause of the error, the
> link shows everything is fine, although if the problem is intermittent that
> doesn't add much. All of our other nodes don't suffer from this problem.  I
> was wondering if anyone could help me make sense of these issues, before I
> swap omnipath cables with the node next to it and see if the error moves?
> I've included ~50 minutes of logs stripped down to the Lustre relevant
> section.
> 
> Thanks for taking the time to look at this,
> James
> 
> Jan 06 02:04:44 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c7ee30400
> Jan 06 02:03:55 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5b37db7800
> Jan 06 02:03:01 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c7ded2400
> Jan 06 02:02:38 node052. kernel: Lustre: Skipped 14
> previous similar messages
> Jan 06 02:02:38 node052. kernel: Lustre: pscratch-
> OST000d-osc-ffff8f7c7a874000: Connection restored to 10.12.3.4 at o2ib (at
> 10.12.3.4 at o2ib)
> Jan 06 02:02:38 node052. kernel: Lustre: Skipped 14
> previous similar messages
> Jan 06 02:02:38 node052. kernel: Lustre: pscratch-
> OST000d-osc-ffff8f7c7a874000: Connection to pscratch-OST000d (at
> 10.12.3.4 at o2ib)
> was lost; in progress operations using this service will wait for recovery
> to
> complete
> Jan 06 02:02:38 node052. kernel: Lustre: 2612:0:
> (client.c:2169:ptlrpc_expire_one_request()) Skipped 14 previous similar
> messages
> Jan 06 02:02:38 node052. kernel: Lustre: 2612:0:
> (client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due
> to
> network error: [sent 1736128955/real 1736128955] req at ffff8f5708379b00
> x1818069513346688/t0(0) o3->pscratch-OST000d-osc-
> ffff8f7c7a874000 at 10.12.3.4@o2ib:6/4 lens 488/440 e 0 to 1 dl 1736128966 ref
> 2 fl
> Rpc:eX/0/ffffffff rc 0/-1
> Jan 06 02:02:38 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5a32f37c00
> Jan 06 02:02:19 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f58ff316c00
> Jan 06 01:58:45 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5b752c0c00
> Jan 06 01:57:50 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5af82b7400
> Jan 06 01:56:08 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5b752c3800
> Jan 06 01:55:17 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f58714ccc00
> Jan 06 01:54:00 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5a9429a000
> Jan 06 01:53:13 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c6d0eb000
> Jan 06 01:52:30 node052. kernel: Lustre: Skipped 18
> previous similar messages
> Jan 06 01:52:30 node052. kernel: Lustre: pscratch-
> OST000d-osc-ffff8f7c7a874000: Connection restored to 10.12.3.4 at o2ib (at
> 10.12.3.4 at o2ib)
> Jan 06 01:52:30 node052. kernel: Lustre: Skipped 19
> previous similar messages
> Jan 06 01:52:30 node052. kernel: Lustre: pscratch-
> OST000d-osc-ffff8f7c7a874000: Connection to pscratch-OST000d (at
> 10.12.3.4 at o2ib)
> was lost; in progress operations using this service will wait for recovery
> to
> complete
> Jan 06 01:52:30 node052. kernel: Lustre: 2596:0:
> (client.c:2169:ptlrpc_expire_one_request()) Skipped 19 previous similar
> messages
> Jan 06 01:52:30 node052. kernel: Lustre: 2596:0:
> (client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out
> for
> slow reply: [sent 1736128306/real 1736128306] req at ffff8f47171e5580
> x1818069513186304/t0(0) o3->pscratch-OST000d-osc-
> ffff8f7c7a874000 at 10.12.3.4@o2ib:6/4 lens 488/440 e 0 to 1 dl 1736128350 ref
> 2 fl
> Rpc:X/2/ffffffff rc 0/-1
> Jan 06 01:51:46 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f576e192400
> Jan 06 01:50:54 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5ad7efa000
> Jan 06 01:50:07 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c7ded0c00
> Jan 06 01:49:49 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f58ff011800
> Jan 06 01:48:44 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c7aa1ec00
> Jan 06 01:47:14 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5a428d7800
> Jan 06 01:45:53 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f7c67fb9c00
> Jan 06 01:45:04 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f7c67fb9c00
> Jan 06 01:44:14 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f7c67fb9c00
> Jan 06 01:43:24 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f7c67fb9c00
> Jan 06 01:42:00 node052. kernel: Lustre: Skipped 20
> previous similar messages
> Jan 06 01:42:00 node052. kernel: Lustre: pscratch-
> OST0011-osc-ffff8f7c7a874000: Connection restored to 10.12.3.5 at o2ib (at
> 10.12.3.5 at o2ib)
> Jan 06 01:42:00 node052. kernel: Lustre: Skipped 20
> previous similar messages
> Jan 06 01:42:00 node052. kernel: Lustre: pscratch-
> OST0011-osc-ffff8f7c7a874000: Connection to pscratch-OST0011 (at
> 10.12.3.5 at o2ib)
> was lost; in progress operations using this service will wait for recovery
> to
> complete
> Jan 06 01:42:00 node052. kernel: Lustre: 2618:0:
> (client.c:2169:ptlrpc_expire_one_request()) Skipped 20 previous similar
> messages
> Jan 06 01:42:00 node052. kernel: Lustre: 2618:0:
> (client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out
> for
> slow reply: [sent 1736127672/real 1736127672] req at ffff8f4d5a6cb180
> x1818069512989632/t0(0) o3->pscratch-OST0011-osc-
> ffff8f7c7a874000 at 10.12.3.5@o2ib:6/4 lens 488/440 e 0 to 1 dl 1736127719 ref
> 2 fl
> Rpc:X/2/ffffffff rc 0/-1
> Jan 06 01:41:12 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5a42875800
> Jan 06 01:40:06 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c6d0ee000
> Jan 06 01:39:54 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c6d0ee000
> Jan 06 01:38:55 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5b0be37800
> Jan 06 01:38:04 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f7c7d710800
> Jan 06 01:36:21 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f4a24fa8800
> Jan 06 01:35:34 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f4a24fa8800
> Jan 06 01:34:48 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f4a24fa8800
> Jan 06 01:33:27 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c7aa1fc00
> Jan 06 01:32:36 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5a653c8400
> Jan 06 01:31:49 node052. kernel: Lustre: Skipped 15
> previous similar messages
> Jan 06 01:31:49 node052. kernel: Lustre: pscratch-
> OST0011-osc-ffff8f7c7a874000: Connection restored to 10.12.3.5 at o2ib (at
> 10.12.3.5 at o2ib)
> Jan 06 01:31:49 node052. kernel: Lustre: Skipped 15
> previous similar messages
> Jan 06 01:31:49 node052. kernel: Lustre: pscratch-
> OST0011-osc-ffff8f7c7a874000: Connection to pscratch-OST0011 (at
> 10.12.3.5 at o2ib)
> was lost; in progress operations using this service will wait for recovery
> to
> complete
> Jan 06 01:31:49 node052. kernel: Lustre: 2608:0:
> (client.c:2169:ptlrpc_expire_one_request()) Skipped 15 previous similar
> messages
> Jan 06 01:31:49 node052. kernel: Lustre: 2608:0:
> (client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due
> to
> network error: [sent 1736127107/real 1736127107] req at ffff8f54f0657980
> x1818069512865792/t0(0) o3->pscratch-OST0011-osc-
> ffff8f7c7a874000 at 10.12.3.5@o2ib:6/4 lens 504/440 e 0 to 1 dl 1736127153 ref
> 2 fl
> Rpc:eX/0/ffffffff rc 0/-1
> Jan 06 01:31:49 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f7c7e714800
> Jan 06 01:30:04 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f58e01d8800
> Jan 06 01:29:12 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f4f40868800
> Jan 06 01:28:07 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c6d0ebc00
> Jan 06 01:26:52 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5a653c9000
> Jan 06 01:25:55 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f7c7e714800
> Jan 06 01:23:58 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f7c7e34cc00
> Jan 06 01:23:04 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f58f1633000
> Jan 06 01:21:01 node052. kernel: Lustre: Skipped 24
> previous similar messages
> Jan 06 01:21:01 node052. kernel: Lustre: pscratch-
> OST0001-osc-ffff8f7c7a874000: Connection restored to 10.12.3.1 at o2ib (at
> 10.12.3.1 at o2ib)
> Jan 06 01:21:01 node052. kernel: Lustre: Skipped 25
> previous similar messages
> Jan 06 01:21:01 node052. kernel: Lustre: pscratch-
> OST0001-osc-ffff8f7c7a874000: Connection to pscratch-OST0001 (at
> 10.12.3.1 at o2ib)
> was lost; in progress operations using this service will wait for recovery
> to
> complete
> Jan 06 01:21:01 node052. kernel: Lustre: 2604:0:
> (client.c:2169:ptlrpc_expire_one_request()) Skipped 25 previous similar
> messages
> Jan 06 01:21:01 node052. kernel: Lustre: 2604:0:
> (client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due
> to
> network error: [sent 1736126457/real 1736126457] req at ffff8f55359ef500
> x1818069512692032/t0(0) o3->pscratch-OST0001-osc-
> ffff8f7c7a874000 at 10.12.3.1@o2ib:6/4 lens 488/440 e 0 to 1 dl 1736126504 ref
> 2 fl
> Rpc:eX/0/ffffffff rc 0/-1
> Jan 06 01:21:01 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5a32f30400
> Jan 06 01:20:00 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c7dfb9c00
> Jan 06 01:15:34 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5a32f34000
> Jan 06 01:14:40 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f58e01d9800
> Jan 06 01:14:05 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f4a24faa000
> Jan 06 01:13:56 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f4a24faa000
> Jan 06 01:13:17 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5c6d0d7400
> Jan 06 01:12:56 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f59fb1d2c00
> Jan 06 01:12:28 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5753e80c00
> Jan 06 01:12:12 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5753e80c00
> Jan 06 01:11:56 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5753e80c00
> Jan 06 01:11:34 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5b37db2c00
> Jan 06 01:11:17 node052. kernel: LustreError: 2552:0:
> (events.c:205:client_bulk_callback()) event type 2, status -103, desc
> ffff8f5b37db2c00
> Jan 06 01:11:00 node052. kernel: Lustre: Skipped 25
> previous similar messages
> Jan 06 01:11:00 node052. kernel: Lustre: pscratch-
> OST0011-osc-ffff8f7c7a874000: Connection restored to 10.12.3.5 at o2ib (at
> 10.12.3.5 at o2ib)
> 
> 
> 
> 
> --
> Dr James Allsopp | Research Platforms team
> IT Services | University of Sheffield
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250121/5ba85607/attachment.htm>
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> 
> ------------------------------
> 
> End of lustre-discuss Digest, Vol 226, Issue 22
> ***********************************************


More information about the lustre-discuss mailing list