<div dir="ltr"><div>I couldn't say exactly but..</div><ul><li style="margin-left:15px">Your net is o2ib1. Is there an o2ib0? </li><li style="margin-left:15px">Are you routing? If so, lnet routing or IB routing? Any issues with the routers or routing? </li><li style="margin-left:15px">Verify the stability of lnet and the fabric path between client and server in the messages above using a tool like lnet_selftest?</li><li style="margin-left:15px">Verify the fabric: Check error counters on the switch and HCA ports involved. Use non-Lustre IB tools (ib_send_bw, etc) to test the fabric. </li></ul><div>Lustre can, and will tell you when lnet issue arise but it cannot tell you anything about the actual network layer it is riding on so it is usually a good idea to certify function of the network layer first before delving into "what LBUG is running my weekend plans?"</div><div><br></div><div>I hope that helps,</div><div><br></div><div>--Jeff<br><br>(resent to list in hopes of being beneficial to others)</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Oct 5, 2023 at 9:34 AM Alastair Basden via lustre-discuss <<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
Lustre 2.12.2.<br>
<br>
We are seeing lots of errors on the servers such as:<br>
Oct 5 11:16:48 oss04 kernel: LNetError: 6414:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-172.19.171.15@o2ib1: -125<br>
Oct 5 11:16:48 oss04 kernel: LustreError: 6414:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8fe066bb9400<br>
<br>
and<br>
Oct 4 14:59:48 oss04 kernel: LustreError: 6383:0:(events.c:305:request_in_callback()) event type 2, status -103, service ost_io<br>
<br>
and<br>
Oct 5 11:18:06 oss04 kernel: LustreError: 6388:0:(events.c:305:request_in_callback()) event type 2, status -5, service ost_io<br>
Oct 5 11:18:06 oss04 kernel: LNet: 6412:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 172.19.171.15@o2ib1<br>
<br>
and on the clients:<br>
m7: Oct 5 14:46:59 m7132 kernel: LustreError: 2466:0:(events.c:200:client_bulk_callback()) event type 2, status -103, desc ffff9a251fc14400<br>
<br>
and<br>
m7: Oct 5 11:18:34 m7086 kernel: LustreError: 2495:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc ffff9a39ad668000<br>
<br>
Does anyone have any ideas about what could be causing this?<br>
<br>
Thanks,<br>
Alastair.<br>
_______________________________________________<br>
lustre-discuss mailing list<br>
<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a><br>
<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>
</blockquote></div><br clear="all"><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr">------------------------------<br>Jeff Johnson<br>Co-Founder<br>Aeon Computing<br><br><a href="mailto:jeff.johnson@aeoncomputing.com" target="_blank">jeff.johnson@aeoncomputing.com</a><br><a href="http://www.aeoncomputing.com" target="_blank">www.aeoncomputing.com</a><br>t: 858-412-3810 x1001 f: 858-412-3845<br>m: 619-204-9061<br><br>4170 Morena Boulevard, Suite C - San Diego, CA 92117<div><br></div><div>High-Performance Computing / Lustre Filesystems / Scale-out Storage</div></div></div></div></div>