[lustre-discuss] Lustre client LNET problem from a novice

Yau Hing Tuen, Bill billyau_hpc at hku.hk
Sun May 2 20:20:25 PDT 2021


Dear Sid,

Thanks a lot for putting me into the right direction:

/etc/lnet.conf does not exist, and the "systemctl restart" produces an 
error accordingly:
descr: Failed to open file: /etc/lnet.conf

I did not remember I created/edited/removed the file. Is that part of a 
default installation?

      I think /etc/modprobe.d/lnet.conf was not important as I was 
attempting dynamic lnet (my plan was to find a working setting with 
dynamic lnet). So the first line was commented out and the second line 
was added when I started to confuse:

# cat /etc/modprobe.d/lnet.conf
#options lnet networks="o2ib0(ibp225s0f0)" routes="tcp 
10.10.200.[4.5]@o2ib0"
options lnet networks="o2ib0(ibp225s0f0)"

     Is that what I should do now is to create a /etc/lnet.conf and retry?

   Regards,
Bill Yau
University of Hong Kong

On 30/4/2021 2:45 pm, lustre-discuss-request at lists.lustre.org wrote:
> Send lustre-discuss mailing list submissions to
> 	lustre-discuss at lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> or, via email, send a message with subject or body 'help' to
> 	lustre-discuss-request at lists.lustre.org
>
> You can reach the person managing the list at
> 	lustre-discuss-owner at lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of lustre-discuss digest..."
>
>
> Today's Topics:
>
>     1. Re: lustre-discuss Digest, Vol 181, Issue 22 (Sid Young)
>     2. Re: [EXTERNAL] Re:  OST mount issue (Mohr, Rick)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 30 Apr 2021 08:04:25 +1000
> From: Sid Young <sid.young at gmail.com>
> To: lustre-discuss <lustre-discuss at lists.lustre.org>
> Subject: Re: [lustre-discuss] lustre-discuss Digest, Vol 181, Issue 22
> Message-ID:
> 	<CAEZ+gOwhCdaWGk=w3tSnh4p8uBVq00GE5OGG2UTumgQHNQ9OTg at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> 3 things....
>
> Can you send your /etc/lnet.conf file
> Can you also send /etc/modprobe.d/lnet.conf
> and does a systemctl restart lnet produce an error?
>
>
> Sid
>
> On Fri, Apr 30, 2021 at 6:27 AM <lustre-discuss-request at lists.lustre.org>
> wrote:
>
>> Send lustre-discuss mailing list submissions to
>>          lustre-discuss at lists.lustre.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>          http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> or, via email, send a message with subject or body 'help' to
>>          lustre-discuss-request at lists.lustre.org
>>
>> You can reach the person managing the list at
>>          lustre-discuss-owner at lists.lustre.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of lustre-discuss digest..."
>> Today's Topics:
>>
>>     1. Lustre client LNET problem from a novice (Yau Hing Tuen, Bill)
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: "Yau Hing Tuen, Bill" <billyau_hpc at hku.hk>
>> To: lustre-discuss at lists.lustre.org
>> Cc:
>> Bcc:
>> Date: Thu, 29 Apr 2021 15:23:51 +0800
>> Subject: [lustre-discuss] Lustre client LNET problem from a novice
>> Dear All,
>>
>>       Need some advice on the following situation: one of my servers
>> (Lustre client only) could no longer connect to the Lustre server.
>> Suspecting some problem on the LNET configuration, but I am too new to
>> Lustre and does not have more clue on how to troubleshoot it.
>>
>> Kernel version: Linux 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18
>> 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>> Lustre version: 2.14.0 (pulled from git)
>> Lustre debs built with GCC 9.3.0 on the server.
>>
>> Modprobe not cleanly complete as static lnet configuration does not work:
>> # modprobe -v lustre
>> insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/libcfs.ko
>> insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/lnet.ko
>> networks="o2ib0(ibp225s0f0)"
>> insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/obdclass.ko
>> insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/ptlrpc.ko
>> modprobe: ERROR: could not insert 'lustre': Network is down
>>
>>       So resort to try dynamic lnet configuration:
>>
>> # lctl net up
>> LNET configure error 100: Network is down
>>
>> # lnetctl net show
>> net:
>>       - net type: lo
>>         local NI(s):
>>           - nid: 0 at lo
>>             status: up
>>
>> # lnetctl net add --net o2ib0 --if ibp225s0f0"
>> add:
>>       - net:
>>             errno: -100
>>             descr: "cannot add network: Network is down"
>>
>>      Having these error messages in dmesg after the above "lnetctl net
>> add" command
>> [265979.237735] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate())
>> lnet: Ignoring interface enxeeeb676d0232: it's down
>> [265979.237738] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate())
>> Skipped 9 previous similar messages
>> [265979.238395] LNetError:
>> 3893180:0:(o2iblnd.c:2655:kiblnd_hdev_get_attr()) Invalid mr size:
>> 0x1000000
>> [265979.267372] LNetError:
>> 3893180:0:(o2iblnd.c:2869:kiblnd_dev_failover()) Can't get device
>> attributes: -22
>> [265979.298129] LNetError: 3893180:0:(o2iblnd.c:3353:kiblnd_startup())
>> ko2iblnd: Can't initialize device: rc = -22
>> [265980.353643] LNetError: 105-4: Error -100 starting up LNI o2ib
>>
>> Initial Diagnosis:
>> # ip link show ibp225s0f0
>> 41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
>> state UP mode DEFAULT group default qlen 256
>>       link/infiniband
>> 00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd
>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>
>> # ip address show ibp225s0f0
>> 41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
>> state UP group default qlen 256
>>       link/infiniband
>> 00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd
>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>       inet 10.10.10.3/16 brd 10.10.255.255 scope global ibp225s0f0
>>          valid_lft forever preferred_lft forever
>>       inet6 fe80::e42:a103:79:991c/64 scope link
>>          valid_lft forever preferred_lft forever
>>
>> # ifconfig ibp225s0f0
>> ibp225s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
>>           inet 10.10.10.3  netmask 255.255.0.0  broadcast 10.10.255.255
>>           inet6 fe80::e42:a103:79:991c  prefixlen 64  scopeid 0x20<link>
>>           unspec 00-00-11-08-FE-80-00-00-00-00-00-00-00-00-00-00
>> txqueuelen 256  (UNSPEC)
>>           RX packets 14363998  bytes 1440476592 (1.4 GB)
>>           RX errors 0  dropped 0  overruns 0  frame 0
>>           TX packets 88  bytes 6648 (6.6 KB)
>>           TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> # lsmod | grep ib
>> ko2iblnd              233472  0
>> lnet                  552960  3 ko2iblnd,obdclass
>> libcfs                487424  3 lnet,ko2iblnd,obdclass
>> ib_umad                28672  0
>> ib_ipoib              110592  0
>> rdma_cm                61440  2 ko2iblnd,rdma_ucm
>> ib_cm                  57344  2 rdma_cm,ib_ipoib
>> mlx5_ib               307200  0
>> mlx_compat             65536  1 ko2iblnd
>> ib_uverbs             126976  2 rdma_ucm,mlx5_ib
>> ib_core               311296  9
>> rdma_cm,ib_ipoib,ko2iblnd,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
>> mlx5_core             933888  1 mlx5_ib
>> libcrc32c              16384  4 nf_conntrack,nf_nat,btrfs,raid456
>>
>>       Also tested ping, ibping and rping, all passed. I have no clue
>> what's happening as the server was able to connect to Lustre.
>>
>>     Regards,
>> Bill Yau
>> University of Hong Kong
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210430/1461f360/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 30 Apr 2021 06:45:14 +0000
> From: "Mohr, Rick" <mohrrf at ornl.gov>
> To: Steve Thompson <smt at vgersoft.com>
> Cc: "lustre-discuss at lists.lustre.org"
> 	<lustre-discuss at lists.lustre.org>
> Subject: Re: [lustre-discuss] [EXTERNAL] Re:  OST mount issue
> Message-ID: <D9E6F2CF-0269-48C3-98CB-A75C68F3DFCB at ornl.gov>
> Content-Type: text/plain; charset="utf-8"
>
> One thing you could do would be to verify that all the kernel modules are identical.  You can try running 'lsmod' to check that the servers have loaded the same set of modules, run 'modinfo' to verify the path to the module that was loaded, and then compute a checksum of the kernel module to compare.
>
> -Rick
>
> ?On 4/26/21, 12:27 PM, "lustre-discuss on behalf of Steve Thompson" <lustre-discuss-bounces at lists.lustre.org on behalf of smt at vgersoft.com> wrote:
>
>      Yes, I believe that something must be different; I just cannot find it. I
>      now have six OST systems. All were installed the same way; two work fine
>      and four do not. The rpm list:
>
>      # rpm -qa | grep lustre
>      lustre-osd-zfs-mount-2.12.6-1.el7.x86_64
>      lustre-2.12.6-1.el7.x86_64
>      lustre-zfs-dkms-2.12.6-1.el7.noarch
>
>      # the mount command example:
>      # grep lustre /etc/fstab
>      fs1/ost1        /mnt/fs1/ost1   lustre defaults,_netdev_  0 0
>
>      and all are the same on all six systems. I currently have ZFS 0.8.5
>      installed, but I have tried with ZFS 0.7.13, and the results are
>      the same.
>
>      Steve
>      --
>      ----------------------------------------------------------------------------
>      Steve Thompson                 E-mail:      smt AT vgersoft DOT com
>      Voyager Software LLC           Web:         http://www DOT vgersoft DOT com
>      3901 N Charles St              VSW Support: support AT vgersoft DOT com
>      Baltimore MD 21218
>         "186,282 miles per second: it's not just a good idea, it's the law"
>      ----------------------------------------------------------------------------
>      _______________________________________________
>      lustre-discuss mailing list
>      lustre-discuss at lists.lustre.org
>      http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> ------------------------------
>
> End of lustre-discuss Digest, Vol 181, Issue 23
> ***********************************************

-- 
Bill Yau
Research Computing
Information Technology Services
The University of Hong Kong

E-mail: billyau_hpc at hku.hk
Tel: (+852) 3917 5185

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210503/d1f73e51/attachment.html>


More information about the lustre-discuss mailing list