[lustre-discuss] lustre-discuss Digest, Vol 181, Issue 22

Sid Young sid.young at gmail.com
Thu Apr 29 15:04:25 PDT 2021


3 things....

Can you send your /etc/lnet.conf file
Can you also send /etc/modprobe.d/lnet.conf
and does a systemctl restart lnet produce an error?


Sid

On Fri, Apr 30, 2021 at 6:27 AM <lustre-discuss-request at lists.lustre.org>
wrote:

> Send lustre-discuss mailing list submissions to
>         lustre-discuss at lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> or, via email, send a message with subject or body 'help' to
>         lustre-discuss-request at lists.lustre.org
>
> You can reach the person managing the list at
>         lustre-discuss-owner at lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of lustre-discuss digest..."
> Today's Topics:
>
>    1. Lustre client LNET problem from a novice (Yau Hing Tuen, Bill)
>
>
>
> ---------- Forwarded message ----------
> From: "Yau Hing Tuen, Bill" <billyau_hpc at hku.hk>
> To: lustre-discuss at lists.lustre.org
> Cc:
> Bcc:
> Date: Thu, 29 Apr 2021 15:23:51 +0800
> Subject: [lustre-discuss] Lustre client LNET problem from a novice
> Dear All,
>
>      Need some advice on the following situation: one of my servers
> (Lustre client only) could no longer connect to the Lustre server.
> Suspecting some problem on the LNET configuration, but I am too new to
> Lustre and does not have more clue on how to troubleshoot it.
>
> Kernel version: Linux 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18
> 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> Lustre version: 2.14.0 (pulled from git)
> Lustre debs built with GCC 9.3.0 on the server.
>
> Modprobe not cleanly complete as static lnet configuration does not work:
> # modprobe -v lustre
> insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/libcfs.ko
> insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/lnet.ko
> networks="o2ib0(ibp225s0f0)"
> insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/obdclass.ko
> insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/ptlrpc.ko
> modprobe: ERROR: could not insert 'lustre': Network is down
>
>      So resort to try dynamic lnet configuration:
>
> # lctl net up
> LNET configure error 100: Network is down
>
> # lnetctl net show
> net:
>      - net type: lo
>        local NI(s):
>          - nid: 0 at lo
>            status: up
>
> # lnetctl net add --net o2ib0 --if ibp225s0f0"
> add:
>      - net:
>            errno: -100
>            descr: "cannot add network: Network is down"
>
>     Having these error messages in dmesg after the above "lnetctl net
> add" command
> [265979.237735] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate())
> lnet: Ignoring interface enxeeeb676d0232: it's down
> [265979.237738] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate())
> Skipped 9 previous similar messages
> [265979.238395] LNetError:
> 3893180:0:(o2iblnd.c:2655:kiblnd_hdev_get_attr()) Invalid mr size:
> 0x1000000
> [265979.267372] LNetError:
> 3893180:0:(o2iblnd.c:2869:kiblnd_dev_failover()) Can't get device
> attributes: -22
> [265979.298129] LNetError: 3893180:0:(o2iblnd.c:3353:kiblnd_startup())
> ko2iblnd: Can't initialize device: rc = -22
> [265980.353643] LNetError: 105-4: Error -100 starting up LNI o2ib
>
> Initial Diagnosis:
> # ip link show ibp225s0f0
> 41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
> state UP mode DEFAULT group default qlen 256
>      link/infiniband
> 00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>
> # ip address show ibp225s0f0
> 41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
> state UP group default qlen 256
>      link/infiniband
> 00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>      inet 10.10.10.3/16 brd 10.10.255.255 scope global ibp225s0f0
>         valid_lft forever preferred_lft forever
>      inet6 fe80::e42:a103:79:991c/64 scope link
>         valid_lft forever preferred_lft forever
>
> # ifconfig ibp225s0f0
> ibp225s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
>          inet 10.10.10.3  netmask 255.255.0.0  broadcast 10.10.255.255
>          inet6 fe80::e42:a103:79:991c  prefixlen 64  scopeid 0x20<link>
>          unspec 00-00-11-08-FE-80-00-00-00-00-00-00-00-00-00-00
> txqueuelen 256  (UNSPEC)
>          RX packets 14363998  bytes 1440476592 (1.4 GB)
>          RX errors 0  dropped 0  overruns 0  frame 0
>          TX packets 88  bytes 6648 (6.6 KB)
>          TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> # lsmod | grep ib
> ko2iblnd              233472  0
> lnet                  552960  3 ko2iblnd,obdclass
> libcfs                487424  3 lnet,ko2iblnd,obdclass
> ib_umad                28672  0
> ib_ipoib              110592  0
> rdma_cm                61440  2 ko2iblnd,rdma_ucm
> ib_cm                  57344  2 rdma_cm,ib_ipoib
> mlx5_ib               307200  0
> mlx_compat             65536  1 ko2iblnd
> ib_uverbs             126976  2 rdma_ucm,mlx5_ib
> ib_core               311296  9
> rdma_cm,ib_ipoib,ko2iblnd,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
> mlx5_core             933888  1 mlx5_ib
> libcrc32c              16384  4 nf_conntrack,nf_nat,btrfs,raid456
>
>      Also tested ping, ibping and rping, all passed. I have no clue
> what's happening as the server was able to connect to Lustre.
>
>    Regards,
> Bill Yau
> University of Hong Kong
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210430/1461f360/attachment.html>


More information about the lustre-discuss mailing list