[lustre-discuss] Lustre client LNET problem from a novice
Yau Hing Tuen, Bill
billyau_hpc at hku.hk
Thu Apr 29 00:23:51 PDT 2021
Dear All,
Need some advice on the following situation: one of my servers
(Lustre client only) could no longer connect to the Lustre server.
Suspecting some problem on the LNET configuration, but I am too new to
Lustre and does not have more clue on how to troubleshoot it.
Kernel version: Linux 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18
17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Lustre version: 2.14.0 (pulled from git)
Lustre debs built with GCC 9.3.0 on the server.
Modprobe not cleanly complete as static lnet configuration does not work:
# modprobe -v lustre
insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/libcfs.ko
insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/lnet.ko
networks="o2ib0(ibp225s0f0)"
insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/obdclass.ko
insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/ptlrpc.ko
modprobe: ERROR: could not insert 'lustre': Network is down
So resort to try dynamic lnet configuration:
# lctl net up
LNET configure error 100: Network is down
# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0 at lo
status: up
# lnetctl net add --net o2ib0 --if ibp225s0f0"
add:
- net:
errno: -100
descr: "cannot add network: Network is down"
Having these error messages in dmesg after the above "lnetctl net
add" command
[265979.237735] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate())
lnet: Ignoring interface enxeeeb676d0232: it's down
[265979.237738] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate())
Skipped 9 previous similar messages
[265979.238395] LNetError:
3893180:0:(o2iblnd.c:2655:kiblnd_hdev_get_attr()) Invalid mr size: 0x1000000
[265979.267372] LNetError:
3893180:0:(o2iblnd.c:2869:kiblnd_dev_failover()) Can't get device
attributes: -22
[265979.298129] LNetError: 3893180:0:(o2iblnd.c:3353:kiblnd_startup())
ko2iblnd: Can't initialize device: rc = -22
[265980.353643] LNetError: 105-4: Error -100 starting up LNI o2ib
Initial Diagnosis:
# ip link show ibp225s0f0
41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
state UP mode DEFAULT group default qlen 256
link/infiniband
00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
# ip address show ibp225s0f0
41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
state UP group default qlen 256
link/infiniband
00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.10.10.3/16 brd 10.10.255.255 scope global ibp225s0f0
valid_lft forever preferred_lft forever
inet6 fe80::e42:a103:79:991c/64 scope link
valid_lft forever preferred_lft forever
# ifconfig ibp225s0f0
ibp225s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 10.10.10.3 netmask 255.255.0.0 broadcast 10.10.255.255
inet6 fe80::e42:a103:79:991c prefixlen 64 scopeid 0x20<link>
unspec 00-00-11-08-FE-80-00-00-00-00-00-00-00-00-00-00
txqueuelen 256 (UNSPEC)
RX packets 14363998 bytes 1440476592 (1.4 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 88 bytes 6648 (6.6 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
# lsmod | grep ib
ko2iblnd 233472 0
lnet 552960 3 ko2iblnd,obdclass
libcfs 487424 3 lnet,ko2iblnd,obdclass
ib_umad 28672 0
ib_ipoib 110592 0
rdma_cm 61440 2 ko2iblnd,rdma_ucm
ib_cm 57344 2 rdma_cm,ib_ipoib
mlx5_ib 307200 0
mlx_compat 65536 1 ko2iblnd
ib_uverbs 126976 2 rdma_ucm,mlx5_ib
ib_core 311296 9
rdma_cm,ib_ipoib,ko2iblnd,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx5_core 933888 1 mlx5_ib
libcrc32c 16384 4 nf_conntrack,nf_nat,btrfs,raid456
Also tested ping, ibping and rping, all passed. I have no clue
what's happening as the server was able to connect to Lustre.
Regards,
Bill Yau
University of Hong Kong
More information about the lustre-discuss
mailing list