[lustre-discuss] how to enforce traffic to OSS on o2ib1 only ?

Stephane Thiell sthiell at stanford.edu
Tue Sep 28 11:56:18 PDT 2021


Hi Riccardo,

I would check if the OSTs on this OSS have been registered with the correct NIDs (o2ib1) on the MGS:

$ lctl --device MGS llog_print <fsname>-client

and look for the NIDs in setup/add_conn for the OSTs in question.

Best,

Stephane



> On Sep 28, 2021, at 9:52 AM, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
> 
> Hello.
> 
> I have a lustre setup where the MDS (172.21.156.112)  is on tcp1 while the OSSes are on o2ib1.
> 
> I am using Lustre 2.12.7 on RHEL 7.9
> 
> All the clients can see the MDS correctly as a tcp1 peer:
> 
> peer:
>     - primary nid: 172.21.156.112 at tcp1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.156.112 at tcp1
>           state: NA
> 
> 
> This is by design because the MDS has no IB interface. So the MDS to OSSes traffic and MDS to Clients traffic is on tcp1, while clients to OSSes traffic is meant to be on o2ib1.
> 
> I have 1 MDS (tcp1)  And 12 OSSes (tcp1, o2ib1) and a bunch of 20 clients (tcp1, o2ib1).
> 
> All is fine but not for one of the OSSes (172.21.164.116 at o2ib1, 172.21.156.102 at tcp1).
> 
> Even though it is configured the same as all the other ones, traffic only goes through tcp1 and not o2ib1.
> 
> Even if I force the peer settings to use o2ib, it ignores it and the tcp1 peer is added anyway
> 
> this is lnet.conf on the MDS
> 
> p2nets:
>  - net-spec: o2ib1
>    interfaces:
>       0: ib0
>  - net-spec: tcp1
>    interfaces:
>       0: eno1
> global:
>     discovery: 0
> 
> 
> 
> this is lnet.conf on OSSes
> 
> ip2nets:
>  - net-spec: o2ib1
>    interfaces:
>       0: ib0
>  - net-spec: tcp1
>    interfaces:
>       0: enp1s0f0
> global:
>     discovery: 0
> 
> 
> 
> I also tried this on the lustre clients side:
> 
> peer:
>     - primary nid: 172.21.164.116 at o2ib1
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.164.116 at o2ib1
> 
> enforcing the peer settings to o2ib1.
> 
> This is ignored and the peer is added by its tcp1 LNET interface.
> 
>     - primary nid: 172.21.156.102 at tcp1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.156.102 at tcp1
>           state: NA
> 
> All of the hosts involved have discovery set to 0.
> 
> Nevertheless the peer setting for that specific OSS is using tcp1 and not o2ib.
> 
> This is disrupting because traffic goes to tcp1 for that specific OSS and it is of course slower than IB.
> 
> I had to deactivate the OSTs on that specific OSS.
> 
> How may I Fix this issue ?
> 
> Here is the complete peer list from the lustre client side and as you can see there is that specific OSS included as tcp1 peer.
> 
> even if I do  "lnetctl peer del --nid 172.21.156.102 at tcp1 --prim_nid 172.21.156.102 at tcp1" the entry is added automatically after a while.
> 
> lnetctl peer show
> peer:
>     - primary nid: 172.21.156.112 at tcp1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.156.112 at tcp1
>           state: NA
>     - primary nid: 172.21.164.111 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.111 at o2ib1
>           state: NA
>     - primary nid: 172.21.164.117 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.117 at o2ib1
>           state: NA
>     - primary nid: 172.21.164.112 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.112 at o2ib1
>           state: NA
>     - primary nid: 172.21.164.119 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.119 at o2ib1
>           state: NA
>     - primary nid: 172.21.164.114 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.114 at o2ib1
>           state: NA
>     - primary nid: 172.21.164.120 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.120 at o2ib1
>           state: NA
>     - primary nid: 172.21.156.102 at tcp1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.156.102 at tcp1
>           state: NA
>     - primary nid: 172.21.164.116 at o2ib1
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.164.116 at o2ib1
>           state: NA
>     - primary nid: 172.21.164.110 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.110 at o2ib1
>           state: NA
>     - primary nid: 172.21.164.115 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.115 at o2ib1
>           state: NA
>     - primary nid: 172.21.164.118 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.118 at o2ib1
>           state: NA
>     - primary nid: 172.21.164.113 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.113 at o2ib1
>           state: NA
>     - primary nid: 172.21.164.121 at o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.121 at o2ib1
>           state: NA
> 
> 
> thanks for looking at this.
> 
> Rick
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list