[lustre-discuss] Lustre client LNET problem from a novice

Yau Hing Tuen, Bill billyau_hpc at hku.hk
Thu Apr 29 00:23:51 PDT 2021


Dear All,

     Need some advice on the following situation: one of my servers 
(Lustre client only) could no longer connect to the Lustre server. 
Suspecting some problem on the LNET configuration, but I am too new to 
Lustre and does not have more clue on how to troubleshoot it.

Kernel version: Linux 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 
17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Lustre version: 2.14.0 (pulled from git)
Lustre debs built with GCC 9.3.0 on the server.

Modprobe not cleanly complete as static lnet configuration does not work:
# modprobe -v lustre
insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/libcfs.ko
insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/lnet.ko 
networks="o2ib0(ibp225s0f0)"
insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/obdclass.ko
insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/ptlrpc.ko
modprobe: ERROR: could not insert 'lustre': Network is down

     So resort to try dynamic lnet configuration:

# lctl net up
LNET configure error 100: Network is down

# lnetctl net show
net:
     - net type: lo
       local NI(s):
         - nid: 0 at lo
           status: up

# lnetctl net add --net o2ib0 --if ibp225s0f0"
add:
     - net:
           errno: -100
           descr: "cannot add network: Network is down"

    Having these error messages in dmesg after the above "lnetctl net 
add" command
[265979.237735] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate()) 
lnet: Ignoring interface enxeeeb676d0232: it's down
[265979.237738] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate()) 
Skipped 9 previous similar messages
[265979.238395] LNetError: 
3893180:0:(o2iblnd.c:2655:kiblnd_hdev_get_attr()) Invalid mr size: 0x1000000
[265979.267372] LNetError: 
3893180:0:(o2iblnd.c:2869:kiblnd_dev_failover()) Can't get device 
attributes: -22
[265979.298129] LNetError: 3893180:0:(o2iblnd.c:3353:kiblnd_startup()) 
ko2iblnd: Can't initialize device: rc = -22
[265980.353643] LNetError: 105-4: Error -100 starting up LNI o2ib

Initial Diagnosis:
# ip link show ibp225s0f0
41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq 
state UP mode DEFAULT group default qlen 256
     link/infiniband 
00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd 
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

# ip address show ibp225s0f0
41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq 
state UP group default qlen 256
     link/infiniband 
00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd 
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
     inet 10.10.10.3/16 brd 10.10.255.255 scope global ibp225s0f0
        valid_lft forever preferred_lft forever
     inet6 fe80::e42:a103:79:991c/64 scope link
        valid_lft forever preferred_lft forever

# ifconfig ibp225s0f0
ibp225s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
         inet 10.10.10.3  netmask 255.255.0.0  broadcast 10.10.255.255
         inet6 fe80::e42:a103:79:991c  prefixlen 64  scopeid 0x20<link>
         unspec 00-00-11-08-FE-80-00-00-00-00-00-00-00-00-00-00 
txqueuelen 256  (UNSPEC)
         RX packets 14363998  bytes 1440476592 (1.4 GB)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 88  bytes 6648 (6.6 KB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# lsmod | grep ib
ko2iblnd              233472  0
lnet                  552960  3 ko2iblnd,obdclass
libcfs                487424  3 lnet,ko2iblnd,obdclass
ib_umad                28672  0
ib_ipoib              110592  0
rdma_cm                61440  2 ko2iblnd,rdma_ucm
ib_cm                  57344  2 rdma_cm,ib_ipoib
mlx5_ib               307200  0
mlx_compat             65536  1 ko2iblnd
ib_uverbs             126976  2 rdma_ucm,mlx5_ib
ib_core               311296  9 
rdma_cm,ib_ipoib,ko2iblnd,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx5_core             933888  1 mlx5_ib
libcrc32c              16384  4 nf_conntrack,nf_nat,btrfs,raid456

     Also tested ping, ibping and rping, all passed. I have no clue 
what's happening as the server was able to connect to Lustre.

   Regards,
Bill Yau
University of Hong Kong


More information about the lustre-discuss mailing list