[lustre-discuss] LNet connectivity issues in virtualized environments
John Bent
johnbent at gmail.com
Mon Mar 3 07:42:55 PST 2025
Dear Lustre Users,
I'm currently troubleshooting an LNet connectivity issue within a
virtualized Lustre cluster and would appreciate any guidance.
*Cluster Setup:*
- Multiple virtual machines (VMs) distributed across several physical
hosts.
- Each VM has two network interfaces:
- A local interface for intra-host communication.
- An enp2s0 interface utilizing WireGuard tunneling for inter-host
communication.
*Issue Description:*
- LNet communications function correctly between VMs residing on the
same physical host.
- LNet communications fail between VMs on different physical hosts.
github.com <https://github.com/open-mpi/ompi/issues/12232>
*Diagnostic Observations:*
>From an OSS on a different physical node:
- ping 192.68.11.35 (MGS address) succeeds.
- lctl ping 192.68.11.35 at tcp results in:
- failed to ping 192.68.11.35 at tcp: Input/output error
In the debug log, I see:
<
> 00000400:00000100:1.0:1741014384.744631:0:27198:0:(acceptor.c:109:lnet_connect_console_error())
> Connection to 192.68.11.35 at tcp at host 192.68.11.35:988 took too long:
> that node may be hung or experiencing high load.
> <
> 00000400:00000200:1.0:1741014384.744636:0:27198:0:(router.c:1739:lnet_notify())
> 192.68.11.4 at tcp notifying 192.68.11.35 at tcp: down
The output of 'lnetctl net show' on the OSS:
net:
> - net type: lo
> local NI(s):
> - nid: 0 at lo
> status: up
> - net type: tcp
> local NI(s):
> - nid: 192.68.11.4 at tcp
> status: up
> interfaces:
> 0: enp2s0
>
The contents of /etc/modprobe.d/lustre.conf are: "options lnet
networks=tcp(enp2s0)"
One last potentially relevant piece of info is that I have a comparable
system working with BeeGFS but testing it with IOR using OpenMPI didn't
work because OpenMPI has strict subnet checking and somehow it didn't like
how I created the virtual network across wireguard tunnels. Instead of
figuring that out, I switched to MPICH which works fine. I have very little
experience in networking so please forgive me if I'm just missing something
very obvious here.
Thanks in advance for any help I might get from this awesome community and
please let me know if there is more info I can provide!
Thanks,
John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250303/a4af0eb5/attachment.htm>
More information about the lustre-discuss
mailing list