[lustre-discuss] lustre-discuss Digest, Vol 181, Issue 22
Sid Young
sid.young at gmail.com
Thu Apr 29 15:04:25 PDT 2021
3 things....
Can you send your /etc/lnet.conf file
Can you also send /etc/modprobe.d/lnet.conf
and does a systemctl restart lnet produce an error?
Sid
On Fri, Apr 30, 2021 at 6:27 AM <lustre-discuss-request at lists.lustre.org>
wrote:
> Send lustre-discuss mailing list submissions to
> lustre-discuss at lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> or, via email, send a message with subject or body 'help' to
> lustre-discuss-request at lists.lustre.org
>
> You can reach the person managing the list at
> lustre-discuss-owner at lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of lustre-discuss digest..."
> Today's Topics:
>
> 1. Lustre client LNET problem from a novice (Yau Hing Tuen, Bill)
>
>
>
> ---------- Forwarded message ----------
> From: "Yau Hing Tuen, Bill" <billyau_hpc at hku.hk>
> To: lustre-discuss at lists.lustre.org
> Cc:
> Bcc:
> Date: Thu, 29 Apr 2021 15:23:51 +0800
> Subject: [lustre-discuss] Lustre client LNET problem from a novice
> Dear All,
>
> Need some advice on the following situation: one of my servers
> (Lustre client only) could no longer connect to the Lustre server.
> Suspecting some problem on the LNET configuration, but I am too new to
> Lustre and does not have more clue on how to troubleshoot it.
>
> Kernel version: Linux 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18
> 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> Lustre version: 2.14.0 (pulled from git)
> Lustre debs built with GCC 9.3.0 on the server.
>
> Modprobe not cleanly complete as static lnet configuration does not work:
> # modprobe -v lustre
> insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/libcfs.ko
> insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/lnet.ko
> networks="o2ib0(ibp225s0f0)"
> insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/obdclass.ko
> insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/ptlrpc.ko
> modprobe: ERROR: could not insert 'lustre': Network is down
>
> So resort to try dynamic lnet configuration:
>
> # lctl net up
> LNET configure error 100: Network is down
>
> # lnetctl net show
> net:
> - net type: lo
> local NI(s):
> - nid: 0 at lo
> status: up
>
> # lnetctl net add --net o2ib0 --if ibp225s0f0"
> add:
> - net:
> errno: -100
> descr: "cannot add network: Network is down"
>
> Having these error messages in dmesg after the above "lnetctl net
> add" command
> [265979.237735] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate())
> lnet: Ignoring interface enxeeeb676d0232: it's down
> [265979.237738] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate())
> Skipped 9 previous similar messages
> [265979.238395] LNetError:
> 3893180:0:(o2iblnd.c:2655:kiblnd_hdev_get_attr()) Invalid mr size:
> 0x1000000
> [265979.267372] LNetError:
> 3893180:0:(o2iblnd.c:2869:kiblnd_dev_failover()) Can't get device
> attributes: -22
> [265979.298129] LNetError: 3893180:0:(o2iblnd.c:3353:kiblnd_startup())
> ko2iblnd: Can't initialize device: rc = -22
> [265980.353643] LNetError: 105-4: Error -100 starting up LNI o2ib
>
> Initial Diagnosis:
> # ip link show ibp225s0f0
> 41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
> state UP mode DEFAULT group default qlen 256
> link/infiniband
> 00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>
> # ip address show ibp225s0f0
> 41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
> state UP group default qlen 256
> link/infiniband
> 00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
> inet 10.10.10.3/16 brd 10.10.255.255 scope global ibp225s0f0
> valid_lft forever preferred_lft forever
> inet6 fe80::e42:a103:79:991c/64 scope link
> valid_lft forever preferred_lft forever
>
> # ifconfig ibp225s0f0
> ibp225s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
> inet 10.10.10.3 netmask 255.255.0.0 broadcast 10.10.255.255
> inet6 fe80::e42:a103:79:991c prefixlen 64 scopeid 0x20<link>
> unspec 00-00-11-08-FE-80-00-00-00-00-00-00-00-00-00-00
> txqueuelen 256 (UNSPEC)
> RX packets 14363998 bytes 1440476592 (1.4 GB)
> RX errors 0 dropped 0 overruns 0 frame 0
> TX packets 88 bytes 6648 (6.6 KB)
> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
>
> # lsmod | grep ib
> ko2iblnd 233472 0
> lnet 552960 3 ko2iblnd,obdclass
> libcfs 487424 3 lnet,ko2iblnd,obdclass
> ib_umad 28672 0
> ib_ipoib 110592 0
> rdma_cm 61440 2 ko2iblnd,rdma_ucm
> ib_cm 57344 2 rdma_cm,ib_ipoib
> mlx5_ib 307200 0
> mlx_compat 65536 1 ko2iblnd
> ib_uverbs 126976 2 rdma_ucm,mlx5_ib
> ib_core 311296 9
> rdma_cm,ib_ipoib,ko2iblnd,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
> mlx5_core 933888 1 mlx5_ib
> libcrc32c 16384 4 nf_conntrack,nf_nat,btrfs,raid456
>
> Also tested ping, ibping and rping, all passed. I have no clue
> what's happening as the server was able to connect to Lustre.
>
> Regards,
> Bill Yau
> University of Hong Kong
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210430/1461f360/attachment.html>
More information about the lustre-discuss
mailing list