[lustre-discuss] Disabling multi-rail dynamic discovery

Riccardo Veraldi riccardo.veraldi at cnaf.infn.it
Mon Sep 13 15:25:07 PDT 2021


I supposed you removed the /etc/modprobe.d/lustre.conf completely.

I only have the lnet service enabled at startup, I do not start any 
lustre3 service, but I am running lustre 2.12.0 sorry not 2.14

so something might be different.

Did you start over with a clean configuration ?

Did you reboot your system to make sure it picks up the new config ? At 
least for me sometimes the lnet module does not unload correctly.

Also I have to mention in my setup I did disable discovery also on the 
OSSes not only client side.

Generally it is not advisable to disable Multi-rail unless you have 
backward compatibility issues with older lustre peers.

But disabling discovery will also disable Multi-rail.

You can try with

lenetctl set discovery 0

as  you already did,

then you do

lnetctl -b export > /etc/lnet.conf

check discovery is set to 0 in the file and if not edit it and set it to 0.

reboot and see if things changes.

If anyway you did not define any tcp interface in lnet.conf  you should 
not see any tcp peers.


On 9/13/21 2:59 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] wrote:
>
> Thanks Rick.  I removed my lnet modprobe options and adapted my 
> lnet.conf file to:
>
> # cat /etc/lnet.conf
>
> ip2nets:
>
> - net-spec: o2ib1
>
>    interfaces:
>
>       0: ib0
>
> global:
>
>     discovery: 0
>
> #
>
> Now "lnetctl export" doesn't have any reference to NIDs on the other 
> networks, so that's good.  However, I'm still seeing some values that 
> concern me:
>
> # lnetctl export | grep -e Multi -e discover | sort -u
>
>     discovery: 1
>
> Multi-Rail: True
>
> #
>
> Any idea why discovery is still 1 if I'm specifying that to 0 in the 
> lnet.conf file?  I'm a little concerned that with Multi-Rail still 
> True and discovery on, the client could still find its way back to the 
> TCP route.
>
> *From: *Riccardo Veraldi <riccardo.veraldi at cnaf.infn.it>
> *Date: *Monday, September 13, 2021 at 3:16 PM
> *To: *"Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
> <darby.vicker-1 at nasa.gov>, "lustre-discuss at lists.lustre.org" 
> <lustre-discuss at lists.lustre.org>
> *Subject: *[EXTERNAL] Re: [lustre-discuss] Disabling multi-rail 
> dynamic discovery
>
> I would use configuration on /etc/lnet.conf and I would not use 
> anymore the older style configuration in
>
> /etc/modprobe.d/lustre.conf
>
> for example in my /etc/lnet.conf configuration I have:
>
> *ip2nets:
>  - net-spec: o2ib
>    interfaces:
>       0: ib0
>  - net-spec: tcp
>    interfaces:
>       0: enp24s0f0
> global:
>     discovery: 0*
>
> As I disabled the auto discovery.
>
> Regarding ko2ib you can just use /etc/modprobe.d/ko2iblnd.conf
>
> Mine looks like this:
>
> *options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 
> ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 
> fmr_cache=1 conns_per_peer=4*
>
> Hope it helps.
>
> Rick
>
> On 9/13/21 1:53 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
> Inc.] via lustre-discuss wrote:
>
>     Hello,
>
>     I would like to know how to turn off auto discovery of peers on a
>     client.  This seems like it should be straight forward but we
>     can't get it to work. Please fill me in on what I'm missing.
>
>     We recently upgraded our servers to 2.14.  Our servers are
>     multi-homed (1 tcp network and 2 separate IB networks) but we want
>     them to be single rail.  On one of our clusters we are still using
>     the 2.12.6 client and it uses one of the IB networks for lustre. 
>     The modprobe file from one of the client nodes:
>
>     # cat /etc/modprobe.d/lustre.conf
>
>     options lnet networks=o2ib1(ib0)
>
>     options ko2iblnd map_on_demand=32
>
>     #
>
>     The client does have a route to the TCP network.  This is intended
>     to allow jobs on the compute nodes to access licenese servers, not
>     for any serious I/O.  We recently discovered that due to some
>     instability in the IB fabric, the client was trying to fail over
>     to tcp:
>
>     # dmesg | grep Lustre
>
>     [  250.205912] Lustre: Lustre: Build Version: 2.12.6
>
>     [  255.886086] Lustre: Mounted scratch-client
>
>     [  287.247547] Lustre:
>     3472:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request
>     sent has timed out for sent delay: [sent 1630699139/real 0] 
>     req at ffff98deb9358480 x1709911947878336/t0(0)
>     o9->hpfs-fsl-OST0001-osc-ffff9880cfb80000 at 192.52.98.33@tcp:28/4
>     <mailto:hpfs-fsl-OST0001-osc-ffff9880cfb80000 at 192.52.98.33@tcp:28/4>
>     lens 224/224 e 0 to 1 dl 1630699145 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
>
>     [  739.832744] Lustre:
>     3526:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request
>     sent has timed out for sent delay: [sent 1630699591/real 0] 
>     req at ffff98deb935da00 x1709911947883520/t0(0)
>     o400->scratch-MDT0000-mdc-ffff98b0f1fc0800 at 192.52.98.31@tcp:12/10
>     <mailto:scratch-MDT0000-mdc-ffff98b0f1fc0800 at 192.52.98.31@tcp:12/10>
>     lens 224/224 e 0 to 1 dl 1630699598 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
>
>     [  739.832755] Lustre:
>     3526:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 5
>     previous similar messages
>
>     [  739.832762] LustreError: 166-1: MGC10.150.100.30 at o2ib1:
>     Connection to MGS (at 192.52.98.30 at tcp) was lost; in progress
>     operations using this service will fail
>
>     [  739.832769] Lustre: hpfs-fsl-MDT0000-mdc-ffff9880cfb80000:
>     Connection to hpfs-fsl-MDT0000 (at 192.52.98.30 at tcp) was lost; in
>     progress operations using this service will wait for recovery to
>     complete
>
>     [ 1090.978619] LustreError: 167-0:
>     scratch-MDT0000-mdc-ffff98b0f1fc0800: This client was evicted by
>     scratch-MDT0000; in progress operations using this service will fail.
>
>     I'm pretty sure this is due to the auto discovery.  Again, from a
>     client:
>
>     # lnetctl export | grep -e Multi -e discover | sort -u
>
>          discovery: 0
>
>            Multi-Rail: True
>
>     #
>
>     We want to restrict lustre to only the IB NID but its not clear
>     exactly how to do that.
>
>     Here is one attempt:
>
>
>     [root at r1i1n18 lnet]# service lustre3 stop
>
>     Shutting down lustre mounts
>
>     Lustre modules successfully unloaded
>
>     [root at r1i1n18 lnet]# lsmod | grep lnet
>
>     [root at r1i1n18 lnet]# cat /etc/lnet.conf
>
>     global:
>
>     discovery: 0
>
>     [root at r1i1n18 lnet]# service lustre3 start
>
>     Mounting /ephemeral... done.
>
>     Mounting /nobackup... done.
>
>     [root at r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover |
>     sort -u
>
>     discovery: 1
>
>     Multi-Rail: True
>
>     [root at r1i1n18 lnet]#
>
>     And a similar attempt (same lnet.conf file), but trying to turn
>     off the discovery before doing the mounts:
>
>     [root at r1i1n18 lnet]# service lustre3 stop
>
>     Shutting down lustre mounts
>
>     Lustre modules successfully unloaded
>
>     [root at r1i1n18 lnet]# modprobe lnet
>
>     [root at r1i1n18 lnet]# lnetctl set discovery 0
>
>     [root at r1i1n18 lnet]# service lustre3 start
>
>     Mounting /ephemeral... done.
>
>     Mounting /nobackup... done.
>
>     [root at r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u
>
>          discovery: 0
>
>            Multi-Rail: True
>
>     [root at r1i1n18 lnet]#
>
>     If someone can point me in the right direction, I'd appreciate it.
>
>     Thanks,
>
>     Darby
>
>
>
>     _______________________________________________
>
>     lustre-discuss mailing list
>
>     lustre-discuss at lists.lustre.org  <mailto:lustre-discuss at lists.lustre.org>
>
>     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org  <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=04%7C01%7Cdarby.vicker-1%40nasa.gov%7Cb2a81e07db45418e29df08d976fbbebd%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637671645889714604%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oRqJx%2FcdY29ppvvulneydVZZY%2Frm1vD8EddtDofafgk%3D&reserved=0>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210913/20b00743/attachment-0001.html>


More information about the lustre-discuss mailing list