[lustre-discuss] Lustre 2.12.0 and locking problems

Amir Shehata amir.shehata.whamcloud at gmail.com
Wed Mar 6 13:44:03 PST 2019


no problem

On Wed, 6 Mar 2019 at 12:15, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it>
wrote:

> On 3/6/19 11:29 AM, Amir Shehata wrote:
>
> The reason for the load being split across the tcp and o2ib0 for the 2.12
> client, is because the MR code sees both interfaces and realizes it can use
> both of them and so it does.
> To disable this behavior you can disable discovery on the 2.12 client. I
> think that should just get the client to only use the single interface it's
> told to.
>
> thank you very much, this worked out well.
>
> We're currently working on a feature (UDSP) which will allow the
> specification of a "preferred" network. In your case you can set the o2ib
> to be the preferred network. It'll always be used unless it becomes
> unavailable. You get two benefits this way: 1) your preference is adhered
> to. 2) reliability, since the tcp network will be used if the o2ib network
> becomes unavailable.this feature
>
> this feature (UDSP) would e really great.
>
>
> Let me know if disabling discovery on your 2.12 clients work.
>
> yes after disabling discovery on the client side, the situation is much
> better
>
>
> thank you very much
>
>
>
> thanks
> amir
>
> On Tue, 5 Mar 2019 at 18:49, Riccardo Veraldi <
> Riccardo.Veraldi at cnaf.infn.it> wrote:
>
>> Hello Amir i answer in-line
>>
>> On 3/5/19 3:42 PM, Amir Shehata wrote:
>>
>> It looks like the ping is passing. Did you try it several times to make
>> sure it always pings successfully?
>>
>> The way it works is the MDS (2.12) discovers all the interfaces on the
>> peer. There is a concept of the primary NID for the peer. That's the first
>> interface configured on the peer. In your case it's the o2ib NID. So when
>> you do lnetctl net show you'll see Primary NID: <nid>@o2ib.
>>
>>     - primary nid: 172.21.52.88 at o2ib
>>        Multi-Rail: True
>>        peer ni:
>>          - nid: 172.21.48.250 at tcp
>>            state: NA
>>          - nid: 172.21.52.88 at o2ib
>>            state: NA
>>          - nid: 172.21.48.250 at tcp1
>>            state: NA
>>          - nid: 172.21.48.250 at tcp2
>>            state: NA
>>
>> On the MDS it uses the primary_nid to identify the peer. So you can ping
>> using the Primary NID. LNet will resolve the Primary NID to the tcp NID. As
>> you can see in the logs, it never actually talks over o2ib. It ends up
>> talking to the peer on its TCP NID, which is what you want to do.
>>
>> I think the problem you're seeing is caused by the combination of 2.12
>> and 2.10.x.
>> From what I understand your servers are 2.12 and your clients are 2.10.x.
>>
>> my clients are 2.10.5 but this problem arise also with one client 2.12.0,
>> anyway the combination of 2.10.0 clients and 2.12.0 is not working right
>>
>>
>> Can you try disabling dynamic discovery on your servers:
>> lnetctl set discovery 0
>>
>> I did this on the MDS and OSS. I did not disable discovery on the client
>> side.
>>
>> now on the MDS side lnetctl peer show looks right.
>>
>> Anyway on the client side where I have both IB and tcp if I write on the
>> lustre filesystem (OSS) what hapens is that the write operation is
>> splitte/load balanced between IB and tcp (Ethernet) and I do not want this.
>> I would like that only IB would be used when the client writes data to the
>> OSS. but both peer ni (o2ib,tcp) are seen from the 2.12.0 client and
>> traffic goes to both of them thus reducing performances because IB is not
>> fully used. This does not happen with 2.10.5 client writing on the same
>> 2.12.0 OSS
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190306/8af165cc/attachment.html>


More information about the lustre-discuss mailing list