[lustre-discuss] Cannot do a ping with LNet over Infiniband

Michael Di Domenico mdidomenico4 at gmail.com
Thu Jan 21 09:40:45 PST 2021


it may seem a silly question, but because you don't show output from
the .21 host is that also configured and up from an lnet perspective?

On Mon, Jan 18, 2021 at 11:39 PM Vinícius Ferrão
<ferrao at versatushpc.com.br> wrote:
>
> Hello,
>
> I’ve been scratching my head for three days now but I cannot do a simple ping over Infiniband using LNet. To be honest I have no idea of whats may be happening. LNet over TCP (on ethernet) seems to work fine. The only way LNet ping works is by pinging itself:
>
> [root at mds1 ~]# lctl ping 10.148.0.20 at o2ib1
> 12345-0 at lo
> 12345-10.24.2.12 at tcp1
> 12345-10.148.0.20 at o2ib1
>
> Everything else just fails:
>
> [root at mds1 ~]# lctl ping 10.148.0.21 at o2ib1
> failed to ping 10.148.0.21 at o2ib1: Input/output error
> [root at mds1 ~]# dmesg -T | tail -n 2
> [Tue Jan 19 01:26:01 2021] LNet: 2424:0:(o2iblnd_cb.c:3405:kiblnd_check_conns()) Timed out tx for 10.148.0.21 at o2ib1: 5095 seconds
> [Tue Jan 19 01:26:01 2021] LNetError: 2362:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.148.0.21 at o2ib1: -125
>
> I can confirm that IPoIB network is working as expected:
>
> [root at mds1 ~]# ping 10.148.0.21
> PING 10.148.0.21 (10.148.0.21) 56(84) bytes of data.
> 64 bytes from 10.148.0.21: icmp_seq=1 ttl=64 time=2.52 ms
> 64 bytes from 10.148.0.21: icmp_seq=2 ttl=64 time=0.085 ms
>
> Configuration seem to match between the two example machines:
>
> [root at mds1 ~]# ifconfig ib0 | head -n 2
> Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
> ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65520
>         inet 10.148.0.20  netmask 255.255.0.0  broadcast 10.148.255.255
>
> [root at mds2 ~]# ifconfig ib0 | head -n 2
> Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
> ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65520
>         inet 10.148.0.21  netmask 255.255.0.0  broadcast 10.148.255.255
>
> Here’s the output of network configuration:
> [root at mds1 ~]# lnetctl net show
> net:
>     - net type: lo
>       local NI(s):
>         - nid: 0 at lo
>           status: up
>     - net type: tcp1
>       local NI(s):
>         - nid: 10.24.2.12 at tcp1
>           status: up
>           interfaces:
>               0: bond0
>     - net type: o2ib1
>       local NI(s):
>         - nid: 10.148.0.20 at o2ib1
>           status: up
>           interfaces:
>               0: ib0
>
> Modules seems to be loaded:
> [root at mds1 ~]# lsmod | egrep "mlx|mlnx|lnet|rdma|ko2iblnd"
> lnet_selftest         274357  0
> ko2iblnd              238469  1
> lnet                  595358  4 ko2iblnd,lnet_selftest,ksocklnd
> libcfs                415577  4 lnet,ko2iblnd,lnet_selftest,ksocklnd
> rdma_ucm               26931  0
> rdma_cm                64252  2 ko2iblnd,rdma_ucm
> iw_cm                  43918  1 rdma_cm
> ib_cm                  53015  3 rdma_cm,ib_ucm,ib_ipoib
> mlx4_en               142468  0
> mlx4_ib               220791  0
> mlx4_core             361489  2 mlx4_en,mlx4_ib
> mlx5_ib               398193  0
> ib_uverbs             134646  3 mlx5_ib,ib_ucm,rdma_ucm
> ib_core               379808  11 rdma_cm,ib_cm,iw_cm,ko2iblnd,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
> mlx5_core            1113637  1 mlx5_ib
> mlxfw                  18227  1 mlx5_core
> devlink                60067  4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core
> mlx_compat             47141  15 rdma_cm,ib_cm,iw_cm,ko2iblnd,mlx4_en,mlx4_ib,mlx5_ib,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
> ptp                    23551  3 i40e,mlx4_en,mlx5_core
>
> Both systems were running CentOS 7.9, Lustre 2.12.6 (IB Branch) and Mellanox OFED 4.9-2.2.4.0.
>
> The only error message that I’ve found is the one that I’ve pasted in the start of this message on dmesg and tem I/O error.
>
> Any help is greatly appreciated.
> Thanks,
> Vinícius.
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


More information about the lustre-discuss mailing list