[lustre-discuss] LNet connectivity issues in virtualized environments

John Bent johnbent at gmail.com
Mon Mar 3 07:42:55 PST 2025


Dear Lustre Users,

I'm currently troubleshooting an LNet connectivity issue within a
virtualized Lustre cluster and would appreciate any guidance.

*Cluster Setup:*

   - Multiple virtual machines (VMs) distributed across several physical
   hosts.
   - Each VM has two network interfaces:
      - A local interface for intra-host communication.
      - An enp2s0 interface utilizing WireGuard tunneling for inter-host
      communication.

*Issue Description:*

   - LNet communications function correctly between VMs residing on the
   same physical host.
   - LNet communications fail between VMs on different physical hosts.
   github.com <https://github.com/open-mpi/ompi/issues/12232>

*Diagnostic Observations:*

>From an OSS on a different physical node:

   - ping 192.68.11.35 (MGS address) succeeds.
   - lctl ping 192.68.11.35 at tcp results in:
   - failed to ping 192.68.11.35 at tcp: Input/output error

In the debug log, I see:

<
> 00000400:00000100:1.0:1741014384.744631:0:27198:0:(acceptor.c:109:lnet_connect_console_error())
> Connection to 192.68.11.35 at tcp at host 192.68.11.35:988 took too long:
> that node may be hung or experiencing high load.
> <
> 00000400:00000200:1.0:1741014384.744636:0:27198:0:(router.c:1739:lnet_notify())
> 192.68.11.4 at tcp notifying 192.68.11.35 at tcp: down


The output of 'lnetctl net show' on the OSS:

net:
>     - net type: lo
>       local NI(s):
>         - nid: 0 at lo
>           status: up
>     - net type: tcp
>       local NI(s):
>         - nid: 192.68.11.4 at tcp
>           status: up
>           interfaces:
>               0: enp2s0
>

The contents of /etc/modprobe.d/lustre.conf are: "options lnet
networks=tcp(enp2s0)"

One last potentially relevant piece of info is that I have a comparable
system working with BeeGFS but testing it with IOR using OpenMPI didn't
work because OpenMPI has strict subnet checking and somehow it didn't like
how I created the virtual network across wireguard tunnels. Instead of
figuring that out, I switched to MPICH which works fine. I have very little
experience in networking so please forgive me if I'm just missing something
very obvious here.

Thanks in advance for any help I might get from this awesome community and
please let me know if there is more info I can provide!

Thanks,

John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250303/a4af0eb5/attachment.htm>


More information about the lustre-discuss mailing list