[lustre-discuss] Cannot do a ping with LNet over Infiniband

Vinícius Ferrão ferrao at versatushpc.com.br
Mon Jan 18 20:39:18 PST 2021


Hello,

I’ve been scratching my head for three days now but I cannot do a simple ping over Infiniband using LNet. To be honest I have no idea of whats may be happening. LNet over TCP (on ethernet) seems to work fine. The only way LNet ping works is by pinging itself:

[root at mds1 ~]# lctl ping 10.148.0.20 at o2ib1
12345-0 at lo
12345-10.24.2.12 at tcp1
12345-10.148.0.20 at o2ib1

Everything else just fails:

[root at mds1 ~]# lctl ping 10.148.0.21 at o2ib1
failed to ping 10.148.0.21 at o2ib1: Input/output error
[root at mds1 ~]# dmesg -T | tail -n 2
[Tue Jan 19 01:26:01 2021] LNet: 2424:0:(o2iblnd_cb.c:3405:kiblnd_check_conns()) Timed out tx for 10.148.0.21 at o2ib1: 5095 seconds
[Tue Jan 19 01:26:01 2021] LNetError: 2362:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.148.0.21 at o2ib1: -125

I can confirm that IPoIB network is working as expected:

[root at mds1 ~]# ping 10.148.0.21
PING 10.148.0.21 (10.148.0.21) 56(84) bytes of data.
64 bytes from 10.148.0.21: icmp_seq=1 ttl=64 time=2.52 ms
64 bytes from 10.148.0.21: icmp_seq=2 ttl=64 time=0.085 ms

Configuration seem to match between the two example machines:

[root at mds1 ~]# ifconfig ib0 | head -n 2
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65520
        inet 10.148.0.20  netmask 255.255.0.0  broadcast 10.148.255.255

[root at mds2 ~]# ifconfig ib0 | head -n 2
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65520
        inet 10.148.0.21  netmask 255.255.0.0  broadcast 10.148.255.255

Here’s the output of network configuration:
[root at mds1 ~]# lnetctl net show 
net:
    - net type: lo
      local NI(s):
        - nid: 0 at lo
          status: up
    - net type: tcp1
      local NI(s):
        - nid: 10.24.2.12 at tcp1
          status: up
          interfaces:
              0: bond0
    - net type: o2ib1
      local NI(s):
        - nid: 10.148.0.20 at o2ib1
          status: up
          interfaces:
              0: ib0

Modules seems to be loaded:
[root at mds1 ~]# lsmod | egrep "mlx|mlnx|lnet|rdma|ko2iblnd"
lnet_selftest         274357  0 
ko2iblnd              238469  1 
lnet                  595358  4 ko2iblnd,lnet_selftest,ksocklnd
libcfs                415577  4 lnet,ko2iblnd,lnet_selftest,ksocklnd
rdma_ucm               26931  0 
rdma_cm                64252  2 ko2iblnd,rdma_ucm
iw_cm                  43918  1 rdma_cm
ib_cm                  53015  3 rdma_cm,ib_ucm,ib_ipoib
mlx4_en               142468  0 
mlx4_ib               220791  0 
mlx4_core             361489  2 mlx4_en,mlx4_ib
mlx5_ib               398193  0 
ib_uverbs             134646  3 mlx5_ib,ib_ucm,rdma_ucm
ib_core               379808  11 rdma_cm,ib_cm,iw_cm,ko2iblnd,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
mlx5_core            1113637  1 mlx5_ib
mlxfw                  18227  1 mlx5_core
devlink                60067  4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core
mlx_compat             47141  15 rdma_cm,ib_cm,iw_cm,ko2iblnd,mlx4_en,mlx4_ib,mlx5_ib,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
ptp                    23551  3 i40e,mlx4_en,mlx5_core

Both systems were running CentOS 7.9, Lustre 2.12.6 (IB Branch) and Mellanox OFED 4.9-2.2.4.0.

The only error message that I’ve found is the one that I’ve pasted in the start of this message on dmesg and tem I/O error.

Any help is greatly appreciated.
Thanks,
Vinícius.





More information about the lustre-discuss mailing list