[lustre-discuss] Lustre 2.12.0 and locking problems
Riccardo Veraldi
Riccardo.Veraldi at cnaf.infn.it
Wed Mar 6 12:15:10 PST 2019
On 3/6/19 11:29 AM, Amir Shehata wrote:
> The reason for the load being split across the tcp and o2ib0 for the
> 2.12 client, is because the MR code sees both interfaces and realizes
> it can use both of them and so it does.
> To disable this behavior you can disable discovery on the 2.12 client.
> I think that should just get the client to only use the single
> interface it's told to.
thank you very much, this worked out well.
> We're currently working on a feature (UDSP) which will allow the
> specification of a "preferred" network. In your case you can set the
> o2ib to be the preferred network. It'll always be used unless it
> becomes unavailable. You get two benefits this way: 1) your preference
> is adhered to. 2) reliability, since the tcp network will be used if
> the o2ib network becomes unavailable.this feature
this feature (UDSP) would e really great.
>
> Let me know if disabling discovery on your 2.12 clients work.
yes after disabling discovery on the client side, the situation is much
better
thank you very much
>
> thanks
> amir
>
> On Tue, 5 Mar 2019 at 18:49, Riccardo Veraldi
> <Riccardo.Veraldi at cnaf.infn.it <mailto:Riccardo.Veraldi at cnaf.infn.it>>
> wrote:
>
> Hello Amir i answer in-line
>
> On 3/5/19 3:42 PM, Amir Shehata wrote:
>> It looks like the ping is passing. Did you try it several times
>> to make sure it always pings successfully?
>>
>> The way it works is the MDS (2.12) discovers all the interfaces
>> on the peer. There is a concept of the primary NID for the peer.
>> That's the first interface configured on the peer. In your case
>> it's the o2ib NID. So when you do lnetctl net show you'll see
>> Primary NID: <nid>@o2ib.
>>
>> - primary nid: 172.21.52.88 at o2ib
>> Multi-Rail: True
>> peer ni:
>> - nid: 172.21.48.250 at tcp
>> state: NA
>> - nid: 172.21.52.88 at o2ib
>> state: NA
>> - nid: 172.21.48.250 at tcp1
>> state: NA
>> - nid: 172.21.48.250 at tcp2
>> state: NA
>>
>> On the MDS it uses the primary_nid to identify the peer. So you
>> can ping using the Primary NID. LNet will resolve the Primary NID
>> to the tcp NID. As you can see in the logs, it never actually
>> talks over o2ib. It ends up talking to the peer on its TCP NID,
>> which is what you want to do.
>>
>> I think the problem you're seeing is caused by the combination of
>> 2.12 and 2.10.x.
>> From what I understand your servers are 2.12 and your clients are
>> 2.10.x.
> my clients are 2.10.5 but this problem arise also with one client
> 2.12.0, anyway the combination of 2.10.0 clients and 2.12.0 is not
> working right
>>
>> Can you try disabling dynamic discovery on your servers:
>> lnetctl set discovery 0
>
> I did this on the MDS and OSS. I did not disable discovery on the
> client side.
>
> now on the MDS side lnetctl peer show looks right.
>
> Anyway on the client side where I have both IB and tcp if I write
> on the lustre filesystem (OSS) what hapens is that the write
> operation is splitte/load balanced between IB and tcp (Ethernet)
> and I do not want this. I would like that only IB would be used
> when the client writes data to the OSS. but both peer ni
> (o2ib,tcp) are seen from the 2.12.0 client and traffic goes to
> both of them thus reducing performances because IB is not fully
> used. This does not happen with 2.10.5 client writing on the same
> 2.12.0 OSS
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190306/b4d5b772/attachment.html>
More information about the lustre-discuss
mailing list