[lustre-discuss] LNET issues

Alastair Basden a.g.basden at durham.ac.uk
Wed Sep 4 01:50:29 PDT 2024


Hi,

We are having some Lnet issues, and wonder if anyone can advise.

Client is 2.15.5, server is 2.12.6.

Fabric is IB.

The file system mounts, but OSTs on a couple of OSSs are not contactable.

Client and servers can ping each other over the IB network.

However, a lnetctl ping fails to/from the bad OSSs to this client.  To 
other clients it's all fine.

i.e. for most of the clients it is working well, just one or two not so.

Server to client:
lnetctl ping 172.18.178.201 at o2ib
manage:
     - ping:
           errno: -1
           descr: failed to ping 172.18.178.201 at o2ib: Input/output error

Client to server:
anage:
     - ping:
           errno: -1
           descr: failed to ping 172.18.185.10 at o2ib: Input/output error



And the o2ib network is noted as down:
lnetctl net show --net o2ib --verbose
net:
     - net type: o2ib
       local NI(s):
         - nid: 172.18.178.216 at o2ib
           status: down
           interfaces:
               0: ibs1f0
           statistics:
               send_count: 45032
               recv_count: 45030
               drop_count: 0
           tunables:
               peer_timeout: 100
               peer_credits: 32
               peer_buffer_credits: 0
               credits: 256
           lnd tunables:
               peercredits_hiw: 16
               map_on_demand: 1
               concurrent_sends: 32
               fmr_pool_size: 512
               fmr_flush_trigger: 384
               fmr_cache: 1
               ntx: 512
               conns_per_peer: 1
           dev cpt: 0
           CPT: "[0,1]"



Could this be a hardware error, even though the IB is working?

Could it be related to https://jira.whamcloud.com/browse/LU-16378 ?

Are there any suggestions on how to bring up the lnet network or fix the 
problems?

Thanks,
Alastair.


More information about the lustre-discuss mailing list