[lustre-discuss] Disabling multi-rail dynamic discovery

Riccardo Veraldi riccardo.veraldi at cnaf.infn.it
Mon Sep 13 14:15:46 PDT 2021


I would use configuration on /etc/lnet.conf and I would not use anymore 
the older style configuration in

/etc/modprobe.d/lustre.conf

for example in my /etc/lnet.conf configuration I have:

*ip2nets:
  - net-spec: o2ib
    interfaces:
       0: ib0
  - net-spec: tcp
    interfaces:
       0: enp24s0f0
global:
     discovery: 0*

As I disabled the auto discovery.

Regarding ko2ib you can just use /etc/modprobe.d/ko2iblnd.conf

Mine looks like this:

*options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 
ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 
fmr_cache=1 conns_per_peer=4*

Hope it helps.

Rick


On 9/13/21 1:53 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] via lustre-discuss wrote:
>
> Hello,
>
> I would like to know how to turn off auto discovery of peers on a 
> client.  This seems like it should be straight forward but we can't 
> get it to work. Please fill me in on what I'm missing.
>
> We recently upgraded our servers to 2.14.  Our servers are multi-homed 
> (1 tcp network and 2 separate IB networks) but we want them to be 
> single rail.  On one of our clusters we are still using the 2.12.6 
> client and it uses one of the IB networks for lustre.  The modprobe 
> file from one of the client nodes:
>
> # cat /etc/modprobe.d/lustre.conf
>
> options lnet networks=o2ib1(ib0)
>
> options ko2iblnd map_on_demand=32
>
> #
>
> The client does have a route to the TCP network.  This is intended to 
> allow jobs on the compute nodes to access licenese servers, not for 
> any serious I/O.  We recently discovered that due to some instability 
> in the IB fabric, the client was trying to fail over to tcp:
>
> # dmesg | grep Lustre
>
> [ 250.205912] Lustre: Lustre: Build Version: 2.12.6
>
> [ 255.886086] Lustre: Mounted scratch-client
>
> [ 287.247547] Lustre: 
> 3472:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent 
> has timed out for sent delay: [sent 1630699139/real 0]  
> req at ffff98deb9358480 x1709911947878336/t0(0) 
> o9->hpfs-fsl-OST0001-osc-ffff9880cfb80000 at 192.52.98.33@tcp:28/4 lens 
> 224/224 e 0 to 1 dl 1630699145 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
>
> [ 739.832744] Lustre: 
> 3526:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent 
> has timed out for sent delay: [sent 1630699591/real 0]  
> req at ffff98deb935da00 x1709911947883520/t0(0) 
> o400->scratch-MDT0000-mdc-ffff98b0f1fc0800 at 192.52.98.31@tcp:12/10 lens 
> 224/224 e 0 to 1 dl 1630699598 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
>
> [ 739.832755] Lustre: 
> 3526:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 5 previous 
> similar messages
>
> [ 739.832762] LustreError: 166-1: MGC10.150.100.30 at o2ib1: Connection 
> to MGS (at 192.52.98.30 at tcp) was lost; in progress operations using 
> this service will fail
>
> [ 739.832769] Lustre: hpfs-fsl-MDT0000-mdc-ffff9880cfb80000: 
> Connection to hpfs-fsl-MDT0000 (at 192.52.98.30 at tcp) was lost; in 
> progress operations using this service will wait for recovery to complete
>
> [ 1090.978619] LustreError: 167-0: 
> scratch-MDT0000-mdc-ffff98b0f1fc0800: This client was evicted by 
> scratch-MDT0000; in progress operations using this service will fail.
>
> I'm pretty sure this is due to the auto discovery.  Again, from a client:
>
> # lnetctl export | grep -e Multi -e discover | sort -u
>      discovery: 0
>        Multi-Rail: True
> #
>
> We want to restrict lustre to only the IB NID but its not clear 
> exactly how to do that.
>
> Here is one attempt:
>
>
> [root at r1i1n18 lnet]# service lustre3 stop
>
> Shutting down lustre mounts
>
> Lustre modules successfully unloaded
>
> [root at r1i1n18 lnet]# lsmod | grep lnet
>
> [root at r1i1n18 lnet]# cat /etc/lnet.conf
>
> global:
>
>     discovery: 0
>
> [root at r1i1n18 lnet]# service lustre3 start
>
> Mounting /ephemeral... done.
>
> Mounting /nobackup... done.
>
> [root at r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u
>
>     discovery: 1
>
> Multi-Rail: True
>
> [root at r1i1n18 lnet]#
>
> And a similar attempt (same lnet.conf file), but trying to turn off 
> the discovery before doing the mounts:
>
> [root at r1i1n18 lnet]# service lustre3 stop
> Shutting down lustre mounts
> Lustre modules successfully unloaded
> [root at r1i1n18 lnet]# modprobe lnet
> [root at r1i1n18 lnet]# lnetctl set discovery 0
> [root at r1i1n18 lnet]# service lustre3 start
> Mounting /ephemeral... done.
> Mounting /nobackup... done.
> [root at r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u
>      discovery: 0
>        Multi-Rail: True
> [root at r1i1n18 lnet]#
>
> If someone can point me in the right direction, I'd appreciate it.
>
> Thanks,
>
> Darby
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210913/f549f2e1/attachment-0001.html>


More information about the lustre-discuss mailing list