[lustre-discuss] LNET issues

Hans Henrik Happe happe at nbi.dk
Wed Sep 11 23:46:23 PDT 2024


Hi,

We started having the same issue after upgrading  servers from 2.12.9 to 
2.15.5 and clients from 2.15.3 to 2.15.5. Only a couple of older OSS had 
the issue. They use Connectx-3 FDR card and the mlx4 driver. After 
replacing them with newer Connectx-4, which use the mlx5 driver, we 
haven't had issue so far. We still have FDR/mlx4 clients using it.

It is the OS (Rocky 8 on servers and Rocky 9 on clients) provided drivers.

Are you using IB cards that use mlx4 driver on the OSS.

Cheers,
Hans Henrik

On 04/09/2024 19.50, Alastair Basden via lustre-discuss wrote:
> Hi Makie,
>
> Yes, sorry, that should be:
>
> From the client (172.18.178.216):
> lnetctl ping 172.18.185.8 at o2ib
> manage:
>     - ping:
>           errno: -1
>           descr: failed to ping 172.18.185.8 at o2ib: Input/output error
>
>
> From the server (172.18.185.8):
> lnetctl ping 172.18.178.216 at o2ib
> manage:
>     - ping:
>           errno: -1
>           descr: failed to ping 172.18.178.216 at o2ib: Input/output error
>
>
>
> And yet a standard ping works.
>
> Pinging to/from other clients and other OSSs works.  i.e. the file 
> system is fully functional and in production, just this client and one 
> or two others are having problems.
>
> We are a link down on the core-edge switch link on the edge switch 
> with this client attached.  Given that a standard ping works, 
> connectivity is there.  But perhaps there is some rdma issue?
>
> Cheers,
> Alastair.
>
> On Wed, 4 Sep 2024, Makia Minich wrote:
>
>> [You don't often get email from makia at systemfabricworks.com. Learn 
>> why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> [EXTERNAL EMAIL]
>>
>> The IP for the nid in your “net show” isn’t any of the nids you 
>> pinged. Is an address misconfigured somewhere?
>>
>>> On Sep 4, 2024, at 2:52 AM, Alastair Basden via lustre-discuss 
>>> <lustre-discuss at lists.lustre.org> wrote:
>>>
>>> Hi,
>>>
>>> We are having some Lnet issues, and wonder if anyone can advise.
>>>
>>> Client is 2.15.5, server is 2.12.6.
>>>
>>> Fabric is IB.
>>>
>>> The file system mounts, but OSTs on a couple of OSSs are not 
>>> contactable.
>>>
>>> Client and servers can ping each other over the IB network.
>>>
>>> However, a lnetctl ping fails to/from the bad OSSs to this client.  
>>> To other clients it's all fine.
>>>
>>> i.e. for most of the clients it is working well, just one or two not 
>>> so.
>>>
>>> Server to client:
>>> lnetctl ping 172.18.178.201 at o2ib
>>> manage:
>>>    - ping:
>>>          errno: -1
>>>          descr: failed to ping 172.18.178.201 at o2ib: Input/output error
>>>
>>> Client to server:
>>> anage:
>>>    - ping:
>>>          errno: -1
>>>          descr: failed to ping 172.18.185.10 at o2ib: Input/output error
>>>
>>>
>>>
>>> And the o2ib network is noted as down:
>>> lnetctl net show --net o2ib --verbose
>>> net:
>>>    - net type: o2ib
>>>      local NI(s):
>>>        - nid: 172.18.178.216 at o2ib
>>>          status: down
>>>          interfaces:
>>>              0: ibs1f0
>>>          statistics:
>>>              send_count: 45032
>>>              recv_count: 45030
>>>              drop_count: 0
>>>          tunables:
>>>              peer_timeout: 100
>>>              peer_credits: 32
>>>              peer_buffer_credits: 0
>>>              credits: 256
>>>          lnd tunables:
>>>              peercredits_hiw: 16
>>>              map_on_demand: 1
>>>              concurrent_sends: 32
>>>              fmr_pool_size: 512
>>>              fmr_flush_trigger: 384
>>>              fmr_cache: 1
>>>              ntx: 512
>>>              conns_per_peer: 1
>>>          dev cpt: 0
>>>          CPT: "[0,1]"
>>>
>>>
>>>
>>> Could this be a hardware error, even though the IB is working?
>>>
>>> Could it be related to https://jira.whamcloud.com/browse/LU-16378 ?
>>>
>>> Are there any suggestions on how to bring up the lnet network or fix 
>>> the problems?
>>>
>>> Thanks,
>>> Alastair.
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240912/781a3e2f/attachment.htm>


More information about the lustre-discuss mailing list