[lustre-discuss] LNET issues

Alastair Basden a.g.basden at durham.ac.uk
Wed Sep 4 10:50:45 PDT 2024


Hi Makie,

Yes, sorry, that should be:

>From the client (172.18.178.216):
lnetctl ping 172.18.185.8 at o2ib
manage:
     - ping:
           errno: -1
           descr: failed to ping 172.18.185.8 at o2ib: Input/output error


>From the server (172.18.185.8):
lnetctl ping 172.18.178.216 at o2ib
manage:
     - ping:
           errno: -1
           descr: failed to ping 172.18.178.216 at o2ib: Input/output error



And yet a standard ping works.

Pinging to/from other clients and other OSSs works.  i.e. the file system 
is fully functional and in production, just this client and one or two 
others are having problems.

We are a link down on the core-edge switch link on the edge switch with 
this client attached.  Given that a standard ping works, connectivity is 
there.  But perhaps there is some rdma issue?

Cheers,
Alastair.

On Wed, 4 Sep 2024, Makia Minich wrote:

> [You don't often get email from makia at systemfabricworks.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> [EXTERNAL EMAIL]
>
> The IP for the nid in your “net show” isn’t any of the nids you pinged. Is an address misconfigured somewhere?
>
>> On Sep 4, 2024, at 2:52 AM, Alastair Basden via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
>>
>> Hi,
>>
>> We are having some Lnet issues, and wonder if anyone can advise.
>>
>> Client is 2.15.5, server is 2.12.6.
>>
>> Fabric is IB.
>>
>> The file system mounts, but OSTs on a couple of OSSs are not contactable.
>>
>> Client and servers can ping each other over the IB network.
>>
>> However, a lnetctl ping fails to/from the bad OSSs to this client.  To other clients it's all fine.
>>
>> i.e. for most of the clients it is working well, just one or two not so.
>>
>> Server to client:
>> lnetctl ping 172.18.178.201 at o2ib
>> manage:
>>    - ping:
>>          errno: -1
>>          descr: failed to ping 172.18.178.201 at o2ib: Input/output error
>>
>> Client to server:
>> anage:
>>    - ping:
>>          errno: -1
>>          descr: failed to ping 172.18.185.10 at o2ib: Input/output error
>>
>>
>>
>> And the o2ib network is noted as down:
>> lnetctl net show --net o2ib --verbose
>> net:
>>    - net type: o2ib
>>      local NI(s):
>>        - nid: 172.18.178.216 at o2ib
>>          status: down
>>          interfaces:
>>              0: ibs1f0
>>          statistics:
>>              send_count: 45032
>>              recv_count: 45030
>>              drop_count: 0
>>          tunables:
>>              peer_timeout: 100
>>              peer_credits: 32
>>              peer_buffer_credits: 0
>>              credits: 256
>>          lnd tunables:
>>              peercredits_hiw: 16
>>              map_on_demand: 1
>>              concurrent_sends: 32
>>              fmr_pool_size: 512
>>              fmr_flush_trigger: 384
>>              fmr_cache: 1
>>              ntx: 512
>>              conns_per_peer: 1
>>          dev cpt: 0
>>          CPT: "[0,1]"
>>
>>
>>
>> Could this be a hardware error, even though the IB is working?
>>
>> Could it be related to https://jira.whamcloud.com/browse/LU-16378 ?
>>
>> Are there any suggestions on how to bring up the lnet network or fix the problems?
>>
>> Thanks,
>> Alastair.
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


More information about the lustre-discuss mailing list