[lustre-discuss] Disabling multi-rail dynamic discovery
Riccardo Veraldi
riccardo.veraldi at cnaf.infn.it
Mon Sep 13 14:15:46 PDT 2021
I would use configuration on /etc/lnet.conf and I would not use anymore
the older style configuration in
/etc/modprobe.d/lustre.conf
for example in my /etc/lnet.conf configuration I have:
*ip2nets:
- net-spec: o2ib
interfaces:
0: ib0
- net-spec: tcp
interfaces:
0: enp24s0f0
global:
discovery: 0*
As I disabled the auto discovery.
Regarding ko2ib you can just use /etc/modprobe.d/ko2iblnd.conf
Mine looks like this:
*options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024
ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512
fmr_cache=1 conns_per_peer=4*
Hope it helps.
Rick
On 9/13/21 1:53 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology,
Inc.] via lustre-discuss wrote:
>
> Hello,
>
> I would like to know how to turn off auto discovery of peers on a
> client. This seems like it should be straight forward but we can't
> get it to work. Please fill me in on what I'm missing.
>
> We recently upgraded our servers to 2.14. Our servers are multi-homed
> (1 tcp network and 2 separate IB networks) but we want them to be
> single rail. On one of our clusters we are still using the 2.12.6
> client and it uses one of the IB networks for lustre. The modprobe
> file from one of the client nodes:
>
> # cat /etc/modprobe.d/lustre.conf
>
> options lnet networks=o2ib1(ib0)
>
> options ko2iblnd map_on_demand=32
>
> #
>
> The client does have a route to the TCP network. This is intended to
> allow jobs on the compute nodes to access licenese servers, not for
> any serious I/O. We recently discovered that due to some instability
> in the IB fabric, the client was trying to fail over to tcp:
>
> # dmesg | grep Lustre
>
> [ 250.205912] Lustre: Lustre: Build Version: 2.12.6
>
> [ 255.886086] Lustre: Mounted scratch-client
>
> [ 287.247547] Lustre:
> 3472:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent
> has timed out for sent delay: [sent 1630699139/real 0]
> req at ffff98deb9358480 x1709911947878336/t0(0)
> o9->hpfs-fsl-OST0001-osc-ffff9880cfb80000 at 192.52.98.33@tcp:28/4 lens
> 224/224 e 0 to 1 dl 1630699145 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
>
> [ 739.832744] Lustre:
> 3526:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent
> has timed out for sent delay: [sent 1630699591/real 0]
> req at ffff98deb935da00 x1709911947883520/t0(0)
> o400->scratch-MDT0000-mdc-ffff98b0f1fc0800 at 192.52.98.31@tcp:12/10 lens
> 224/224 e 0 to 1 dl 1630699598 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
>
> [ 739.832755] Lustre:
> 3526:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 5 previous
> similar messages
>
> [ 739.832762] LustreError: 166-1: MGC10.150.100.30 at o2ib1: Connection
> to MGS (at 192.52.98.30 at tcp) was lost; in progress operations using
> this service will fail
>
> [ 739.832769] Lustre: hpfs-fsl-MDT0000-mdc-ffff9880cfb80000:
> Connection to hpfs-fsl-MDT0000 (at 192.52.98.30 at tcp) was lost; in
> progress operations using this service will wait for recovery to complete
>
> [ 1090.978619] LustreError: 167-0:
> scratch-MDT0000-mdc-ffff98b0f1fc0800: This client was evicted by
> scratch-MDT0000; in progress operations using this service will fail.
>
> I'm pretty sure this is due to the auto discovery. Again, from a client:
>
> # lnetctl export | grep -e Multi -e discover | sort -u
> discovery: 0
> Multi-Rail: True
> #
>
> We want to restrict lustre to only the IB NID but its not clear
> exactly how to do that.
>
> Here is one attempt:
>
>
> [root at r1i1n18 lnet]# service lustre3 stop
>
> Shutting down lustre mounts
>
> Lustre modules successfully unloaded
>
> [root at r1i1n18 lnet]# lsmod | grep lnet
>
> [root at r1i1n18 lnet]# cat /etc/lnet.conf
>
> global:
>
> discovery: 0
>
> [root at r1i1n18 lnet]# service lustre3 start
>
> Mounting /ephemeral... done.
>
> Mounting /nobackup... done.
>
> [root at r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u
>
> discovery: 1
>
> Multi-Rail: True
>
> [root at r1i1n18 lnet]#
>
> And a similar attempt (same lnet.conf file), but trying to turn off
> the discovery before doing the mounts:
>
> [root at r1i1n18 lnet]# service lustre3 stop
> Shutting down lustre mounts
> Lustre modules successfully unloaded
> [root at r1i1n18 lnet]# modprobe lnet
> [root at r1i1n18 lnet]# lnetctl set discovery 0
> [root at r1i1n18 lnet]# service lustre3 start
> Mounting /ephemeral... done.
> Mounting /nobackup... done.
> [root at r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u
> discovery: 0
> Multi-Rail: True
> [root at r1i1n18 lnet]#
>
> If someone can point me in the right direction, I'd appreciate it.
>
> Thanks,
>
> Darby
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210913/f549f2e1/attachment-0001.html>
More information about the lustre-discuss
mailing list