[lustre-discuss] LNET issues
Alastair Basden
a.g.basden at durham.ac.uk
Wed Sep 4 10:50:45 PDT 2024
Hi Makie,
Yes, sorry, that should be:
>From the client (172.18.178.216):
lnetctl ping 172.18.185.8 at o2ib
manage:
- ping:
errno: -1
descr: failed to ping 172.18.185.8 at o2ib: Input/output error
>From the server (172.18.185.8):
lnetctl ping 172.18.178.216 at o2ib
manage:
- ping:
errno: -1
descr: failed to ping 172.18.178.216 at o2ib: Input/output error
And yet a standard ping works.
Pinging to/from other clients and other OSSs works. i.e. the file system
is fully functional and in production, just this client and one or two
others are having problems.
We are a link down on the core-edge switch link on the edge switch with
this client attached. Given that a standard ping works, connectivity is
there. But perhaps there is some rdma issue?
Cheers,
Alastair.
On Wed, 4 Sep 2024, Makia Minich wrote:
> [You don't often get email from makia at systemfabricworks.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> [EXTERNAL EMAIL]
>
> The IP for the nid in your “net show” isn’t any of the nids you pinged. Is an address misconfigured somewhere?
>
>> On Sep 4, 2024, at 2:52 AM, Alastair Basden via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
>>
>> Hi,
>>
>> We are having some Lnet issues, and wonder if anyone can advise.
>>
>> Client is 2.15.5, server is 2.12.6.
>>
>> Fabric is IB.
>>
>> The file system mounts, but OSTs on a couple of OSSs are not contactable.
>>
>> Client and servers can ping each other over the IB network.
>>
>> However, a lnetctl ping fails to/from the bad OSSs to this client. To other clients it's all fine.
>>
>> i.e. for most of the clients it is working well, just one or two not so.
>>
>> Server to client:
>> lnetctl ping 172.18.178.201 at o2ib
>> manage:
>> - ping:
>> errno: -1
>> descr: failed to ping 172.18.178.201 at o2ib: Input/output error
>>
>> Client to server:
>> anage:
>> - ping:
>> errno: -1
>> descr: failed to ping 172.18.185.10 at o2ib: Input/output error
>>
>>
>>
>> And the o2ib network is noted as down:
>> lnetctl net show --net o2ib --verbose
>> net:
>> - net type: o2ib
>> local NI(s):
>> - nid: 172.18.178.216 at o2ib
>> status: down
>> interfaces:
>> 0: ibs1f0
>> statistics:
>> send_count: 45032
>> recv_count: 45030
>> drop_count: 0
>> tunables:
>> peer_timeout: 100
>> peer_credits: 32
>> peer_buffer_credits: 0
>> credits: 256
>> lnd tunables:
>> peercredits_hiw: 16
>> map_on_demand: 1
>> concurrent_sends: 32
>> fmr_pool_size: 512
>> fmr_flush_trigger: 384
>> fmr_cache: 1
>> ntx: 512
>> conns_per_peer: 1
>> dev cpt: 0
>> CPT: "[0,1]"
>>
>>
>>
>> Could this be a hardware error, even though the IB is working?
>>
>> Could it be related to https://jira.whamcloud.com/browse/LU-16378 ?
>>
>> Are there any suggestions on how to bring up the lnet network or fix the problems?
>>
>> Thanks,
>> Alastair.
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
More information about the lustre-discuss
mailing list