[lustre-discuss] LNET issues
Alastair Basden
a.g.basden at durham.ac.uk
Wed Sep 4 01:50:29 PDT 2024
Hi,
We are having some Lnet issues, and wonder if anyone can advise.
Client is 2.15.5, server is 2.12.6.
Fabric is IB.
The file system mounts, but OSTs on a couple of OSSs are not contactable.
Client and servers can ping each other over the IB network.
However, a lnetctl ping fails to/from the bad OSSs to this client. To
other clients it's all fine.
i.e. for most of the clients it is working well, just one or two not so.
Server to client:
lnetctl ping 172.18.178.201 at o2ib
manage:
- ping:
errno: -1
descr: failed to ping 172.18.178.201 at o2ib: Input/output error
Client to server:
anage:
- ping:
errno: -1
descr: failed to ping 172.18.185.10 at o2ib: Input/output error
And the o2ib network is noted as down:
lnetctl net show --net o2ib --verbose
net:
- net type: o2ib
local NI(s):
- nid: 172.18.178.216 at o2ib
status: down
interfaces:
0: ibs1f0
statistics:
send_count: 45032
recv_count: 45030
drop_count: 0
tunables:
peer_timeout: 100
peer_credits: 32
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 16
map_on_demand: 1
concurrent_sends: 32
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
dev cpt: 0
CPT: "[0,1]"
Could this be a hardware error, even though the IB is working?
Could it be related to https://jira.whamcloud.com/browse/LU-16378 ?
Are there any suggestions on how to bring up the lnet network or fix the
problems?
Thanks,
Alastair.
More information about the lustre-discuss
mailing list