[lustre-discuss] Multiple IB Interfaces

Alastair Basden a.g.basden at durham.ac.uk
Fri Mar 12 01:32:03 PST 2021


Hi all,

Thanks for the replies.  The issue as I see it is with sending data from 
an OST to the client, avoiding the inter-CPU link.

So, if I have:
cpu1 - IB card 1 (10.0.0.1), nvme1 (OST1)
cpu2 - IB card 2 (10.0.0.2), nvme2 (OST2)

Both IB cards on the same subnet.  Therefore, by default, packets will be 
routed out of the server over the preferred card, say IB card 1 (I could 
be wrong, but this is my current understanding, and seems to be what the 
Lustre manual says).

Data coming in (being written to the OST) is not a problem.  The client 
will know the IP address of the card to which the OST is closest.   So, 
to write to OST2, it will use the 10.0.0.2 address (since this will be 
the IP address given in mkfs.lustre for that OST).

The slight complication here is pinning.  A cpu thread may run on cpu1, so 
the data has to traverse the inter-cpu link twice.  However, I am assuming 
that this won't happen - i.e. the kernel or lustre are clever enough to 
place this thread on cpu2.  As far as I am aware, this should just work, 
though please correct me if I'm wrong.  Perhaps I have to manually specify 
pinning - how does one do that with Lustre?

Reading is more problematic.  A request from a client (say 10.0.0.100) for 
data on OST2 will come in via card 2 (10.0.0.2).  A thread on CPU2 
(hopefully) will then read the data from OST2, and send it out to the 
client, 10.0.0.100.  However, here, Linux will route the packet through 
the first card on this subnet, so it will go over the inter-cpu link, and 
out of IB card 1.  And this will be the case even if the thread is pinned 
on CPU2.

The question then is whether there is a way to configure Lustre to use IB 
card 2 when sending out data from OST2.

Cheers,
Alastair.

On Wed, 10 Mar 2021, Ms. Megan Larko wrote:

> [EXTERNAL EMAIL]
> Greetings Alastair,
>
> Bonding is supported on InfiniBand, but  I believe that it is only active/passive.
> I think what you might be looking for WRT avoiding data travel through the inter-cpu link is cpu "affinity" AKA cpu "pinning".
>
> Cheers,
> megan
>
> WRT = "with regards to"
> AKA = "also known as"
>


More information about the lustre-discuss mailing list