[lustre-discuss] Cannot do a ping with LNet over Infiniband
Vinícius Ferrão
ferrao at versatushpc.com.br
Mon Jan 18 20:39:18 PST 2021
Hello,
I’ve been scratching my head for three days now but I cannot do a simple ping over Infiniband using LNet. To be honest I have no idea of whats may be happening. LNet over TCP (on ethernet) seems to work fine. The only way LNet ping works is by pinging itself:
[root at mds1 ~]# lctl ping 10.148.0.20 at o2ib1
12345-0 at lo
12345-10.24.2.12 at tcp1
12345-10.148.0.20 at o2ib1
Everything else just fails:
[root at mds1 ~]# lctl ping 10.148.0.21 at o2ib1
failed to ping 10.148.0.21 at o2ib1: Input/output error
[root at mds1 ~]# dmesg -T | tail -n 2
[Tue Jan 19 01:26:01 2021] LNet: 2424:0:(o2iblnd_cb.c:3405:kiblnd_check_conns()) Timed out tx for 10.148.0.21 at o2ib1: 5095 seconds
[Tue Jan 19 01:26:01 2021] LNetError: 2362:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.148.0.21 at o2ib1: -125
I can confirm that IPoIB network is working as expected:
[root at mds1 ~]# ping 10.148.0.21
PING 10.148.0.21 (10.148.0.21) 56(84) bytes of data.
64 bytes from 10.148.0.21: icmp_seq=1 ttl=64 time=2.52 ms
64 bytes from 10.148.0.21: icmp_seq=2 ttl=64 time=0.085 ms
Configuration seem to match between the two example machines:
[root at mds1 ~]# ifconfig ib0 | head -n 2
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520
inet 10.148.0.20 netmask 255.255.0.0 broadcast 10.148.255.255
[root at mds2 ~]# ifconfig ib0 | head -n 2
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520
inet 10.148.0.21 netmask 255.255.0.0 broadcast 10.148.255.255
Here’s the output of network configuration:
[root at mds1 ~]# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0 at lo
status: up
- net type: tcp1
local NI(s):
- nid: 10.24.2.12 at tcp1
status: up
interfaces:
0: bond0
- net type: o2ib1
local NI(s):
- nid: 10.148.0.20 at o2ib1
status: up
interfaces:
0: ib0
Modules seems to be loaded:
[root at mds1 ~]# lsmod | egrep "mlx|mlnx|lnet|rdma|ko2iblnd"
lnet_selftest 274357 0
ko2iblnd 238469 1
lnet 595358 4 ko2iblnd,lnet_selftest,ksocklnd
libcfs 415577 4 lnet,ko2iblnd,lnet_selftest,ksocklnd
rdma_ucm 26931 0
rdma_cm 64252 2 ko2iblnd,rdma_ucm
iw_cm 43918 1 rdma_cm
ib_cm 53015 3 rdma_cm,ib_ucm,ib_ipoib
mlx4_en 142468 0
mlx4_ib 220791 0
mlx4_core 361489 2 mlx4_en,mlx4_ib
mlx5_ib 398193 0
ib_uverbs 134646 3 mlx5_ib,ib_ucm,rdma_ucm
ib_core 379808 11 rdma_cm,ib_cm,iw_cm,ko2iblnd,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
mlx5_core 1113637 1 mlx5_ib
mlxfw 18227 1 mlx5_core
devlink 60067 4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core
mlx_compat 47141 15 rdma_cm,ib_cm,iw_cm,ko2iblnd,mlx4_en,mlx4_ib,mlx5_ib,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
ptp 23551 3 i40e,mlx4_en,mlx5_core
Both systems were running CentOS 7.9, Lustre 2.12.6 (IB Branch) and Mellanox OFED 4.9-2.2.4.0.
The only error message that I’ve found is the one that I’ve pasted in the start of this message on dmesg and tem I/O error.
Any help is greatly appreciated.
Thanks,
Vinícius.
More information about the lustre-discuss
mailing list