<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
Hi,<br>
<br>
We started having the same issue after upgrading servers from
2.12.9 to 2.15.5 and clients from 2.15.3 to 2.15.5. Only a couple of
older OSS had the issue. They use Connectx-3 FDR card and the mlx4
driver. After replacing them with newer Connectx-4, which use the
mlx5 driver, we haven't had issue so far. We still have FDR/mlx4
clients using it.<br>
<br>
It is the OS (Rocky 8 on servers and Rocky 9 on clients) provided
drivers.<br>
<br>
Are you using IB cards that use mlx4 driver on the OSS.<br>
<br>
Cheers,<br>
Hans Henrik<br>
<br>
<div class="moz-cite-prefix">On 04/09/2024 19.50, Alastair Basden
via lustre-discuss wrote:<br>
</div>
<blockquote type="cite"
cite="mid:83bbfbcb-37a3-b585-73b1-257098facb3@durham.ac.uk">Hi
Makie, <br>
<br>
Yes, sorry, that should be:<br>
<br>
From the client (172.18.178.216): <br>
lnetctl ping 172.18.185.8@o2ib <br>
manage: <br>
- ping: <br>
errno: -1 <br>
descr: failed to ping 172.18.185.8@o2ib: Input/output
error <br>
<br>
<br>
From the server (172.18.185.8): <br>
lnetctl ping 172.18.178.216@o2ib <br>
manage: <br>
- ping: <br>
errno: -1 <br>
descr: failed to ping 172.18.178.216@o2ib: Input/output
error <br>
<br>
<br>
<br>
And yet a standard ping works. <br>
<br>
Pinging to/from other clients and other OSSs works. i.e. the file
system is fully functional and in production, just this client and
one or two others are having problems. <br>
<br>
We are a link down on the core-edge switch link on the edge switch
with this client attached. Given that a standard ping works,
connectivity is there. But perhaps there is some rdma issue? <br>
<br>
Cheers, <br>
Alastair. <br>
<br>
On Wed, 4 Sep 2024, Makia Minich wrote: <br>
<br>
<blockquote type="cite">[You don't often get email from <a
class="moz-txt-link-abbreviated moz-txt-link-freetext"
href="mailto:makia@systemfabricworks.com"
moz-do-not-send="true">makia@systemfabricworks.com</a>. Learn
why this is important at <a class="moz-txt-link-freetext"
href="https://aka.ms/LearnAboutSenderIdentification"
moz-do-not-send="true">https://aka.ms/LearnAboutSenderIdentification</a>
] <br>
<br>
[EXTERNAL EMAIL] <br>
<br>
The IP for the nid in your “net show” isn’t any of the nids you
pinged. Is an address misconfigured somewhere? <br>
<br>
<blockquote type="cite">On Sep 4, 2024, at 2:52 AM, Alastair
Basden via lustre-discuss <a class="moz-txt-link-rfc2396E"
href="mailto:lustre-discuss@lists.lustre.org"
moz-do-not-send="true"><lustre-discuss@lists.lustre.org></a>
wrote: <br>
<br>
Hi, <br>
<br>
We are having some Lnet issues, and wonder if anyone can
advise. <br>
<br>
Client is 2.15.5, server is 2.12.6. <br>
<br>
Fabric is IB. <br>
<br>
The file system mounts, but OSTs on a couple of OSSs are not
contactable. <br>
<br>
Client and servers can ping each other over the IB network. <br>
<br>
However, a lnetctl ping fails to/from the bad OSSs to this
client. To other clients it's all fine. <br>
<br>
i.e. for most of the clients it is working well, just one or
two not so. <br>
<br>
Server to client: <br>
lnetctl ping 172.18.178.201@o2ib <br>
manage: <br>
- ping: <br>
errno: -1 <br>
descr: failed to ping 172.18.178.201@o2ib:
Input/output error <br>
<br>
Client to server: <br>
anage: <br>
- ping: <br>
errno: -1 <br>
descr: failed to ping 172.18.185.10@o2ib:
Input/output error <br>
<br>
<br>
<br>
And the o2ib network is noted as down: <br>
lnetctl net show --net o2ib --verbose <br>
net: <br>
- net type: o2ib <br>
local NI(s): <br>
- nid: 172.18.178.216@o2ib <br>
status: down <br>
interfaces: <br>
0: ibs1f0 <br>
statistics: <br>
send_count: 45032 <br>
recv_count: 45030 <br>
drop_count: 0 <br>
tunables: <br>
peer_timeout: 100 <br>
peer_credits: 32 <br>
peer_buffer_credits: 0 <br>
credits: 256 <br>
lnd tunables: <br>
peercredits_hiw: 16 <br>
map_on_demand: 1 <br>
concurrent_sends: 32 <br>
fmr_pool_size: 512 <br>
fmr_flush_trigger: 384 <br>
fmr_cache: 1 <br>
ntx: 512 <br>
conns_per_peer: 1 <br>
dev cpt: 0 <br>
CPT: "[0,1]" <br>
<br>
<br>
<br>
Could this be a hardware error, even though the IB is working?
<br>
<br>
Could it be related to <a class="moz-txt-link-freetext"
href="https://jira.whamcloud.com/browse/LU-16378"
moz-do-not-send="true">https://jira.whamcloud.com/browse/LU-16378</a>
? <br>
<br>
Are there any suggestions on how to bring up the lnet network
or fix the problems? <br>
<br>
Thanks, <br>
Alastair. <br>
_______________________________________________ <br>
lustre-discuss mailing list <br>
<a class="moz-txt-link-abbreviated moz-txt-link-freetext"
href="mailto:lustre-discuss@lists.lustre.org"
moz-do-not-send="true">lustre-discuss@lists.lustre.org</a> <br>
<a class="moz-txt-link-freetext"
href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org"
moz-do-not-send="true">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>
<br>
</blockquote>
<br>
</blockquote>
_______________________________________________ <br>
lustre-discuss mailing list <br>
<a class="moz-txt-link-abbreviated moz-txt-link-freetext"
href="mailto:lustre-discuss@lists.lustre.org"
moz-do-not-send="true">lustre-discuss@lists.lustre.org</a> <br>
<a class="moz-txt-link-freetext"
href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org"
moz-do-not-send="true">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>
<br>
</blockquote>
<br>
</body>
</html>