<div dir="ltr">To build up on this email trail, the Scope and Requirements document is now available here:<div><br></div><div><a href="http://wiki.lustre.org/Multi-Rail_LNet">http://wiki.lustre.org/Multi-Rail_LNet</a><br></div><div><br></div><div>Exact link:<br><br></div><div><a href="http://wiki.lustre.org/images/7/73/Multi-Rail%2BScope%2Band%2BRequirements%2BDocument.pdf">http://wiki.lustre.org/images/7/73/Multi-Rail%2BScope%2Band%2BRequirements%2BDocument.pdf</a><br></div><div><br></div><div>thanks</div><div>amir</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 16 October 2015 at 03:14, Olaf Weber <span dir="ltr"><<a href="mailto:olaf@sgi.com" target="_blank">olaf@sgi.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">As some of you know, I held a presentation at the LAD'15 developers<br>

meeting describing a proposal for implementing multi-rail networking<br>

for Lustre. Some discussion on this list has referenced that talk.<br>

The slides I used can be found here (1MB PDF):<br>

<br>

  <a href="http://wiki.lustre.org/images/7/79/LAD15_Lustre_Interface_Bonding_Final.pdf" rel="noreferrer" target="_blank">http://wiki.lustre.org/images/7/79/LAD15_Lustre_Interface_Bonding_Final.pdf</a><br>

<br>

Since I grew up when a slide deck was a pile of transparencies, and<br>

the rule was that you'd used too much text if people could get more<br>

than the bare gist of your talk from the slides, the rest of this mail<br>

is a slide-by-slide paraphrase of the talk. (There is no recording:<br>

unless the other attendees weigh in this is the best you'll get.) A<br>

few points are starred to indicate that they are after-the-talk<br>

additions.<br>

<br>

<br>

Slide 1: Lustre Interface Bonding<br>

<br>

Title - boring.<br>

<br>

<br>

Slide 2: Interface Bonding<br>

<br>

Under various names multi-rail has been a longstanding wishlist<br>

item. The various names do imply technical differences in how people<br>

think about the problem and the solutions they propose. Despite the<br>

title of the presentation, this proposal is best characterized by<br>

"multi-rail", which is the term we've been using internally at SGI.<br>

<br>

Fujitsu contributed an implementation at the level of the infiniband<br>

LND. It wasn't landed, and some of the reviewers felt that an LNet-<br>

level solution should be investigated instead. This code was a big<br>

influence on how I ended up approaching the problem.<br>

<br>

The current proposal is a collaboration between SGI and Intel. There<br>

is in fact a contract and resources have been committed. Whether the<br>

implementation will match this proposal is still an open question:<br>

these are the early days, and your feedback is welcome and can be<br>

taken into account.<br>

<br>

The end goal is general availability.<br>

<br>

<br>

Slide 3: Why Multi-Rail?<br>

<br>

At SGI we care because our big systems can have tens of terabytes of<br>

memory, and therefore need a fat connection to the rest of a Lustre<br>

cluster.<br>

<br>

An additional complication is that big systems have an internal<br>

"network" (NUMAlink in SGI's case) and it can matter a lot for<br>

performance whether memory is close or remote to an interface. So what<br>

we want is to have multiple interfaces spread throughout a system, and<br>

then be able to use whichever will be most efficient for a particular<br>

I/O operation.<br>

<br>

<br>

Slide 4: Design Constraints<br>

<br>

These are a couple of constraints (or requirements if you prefer) that<br>

the design tries to satisfy.<br>

<br>

Mixed-version clusters: it should not be a requirement to update an<br>

entire cluster because of a few multi-rail capable nodes. Moreover, in<br>

mixed- vendor situations, it may not be possible to upgrade an entire<br>

cluster in one fell swoop.<br>

<br>

Simple configuration: creating and distributing configuration files,<br>

and then keeping them in sync across a cluster, becomes tricky once<br>

clusters get bigger. So I look for ways to have the systems configure<br>

themselves.<br>

<br>

Adaptable: autoconfiguration is nice, but there are always cases where<br>

it doesn't get things quite right. There have to be ways to fine-tune<br>

a system or cluster, or even to completely override the<br>

autoconfiguration.<br>

<br>

LNet-level implementation: there are three levels at which you can<br>

reasonably implement multi-rail: LND, LNet, and PortalRPC. An<br>

LND-level solution has as its main disadvantage that you cannot<br>

balance I/O load between LNDs. A PortalRPC-level solution would<br>

certainly follow a commonly design tenet in networking: "the upper<br>

layers will take care of that". The upper layers just want a reliable<br>

network, thankyouverymuch. LNet seems like the right spot for<br>

something like this. It allows the implementation to be reasonably<br>

self-contained within the LNet subsystem.<br>

<br>

<br>

Slide 5: Example Lustre Cluster<br>

<br>

A simple cluster, used to guide the discussion. Missing in the picture<br>

is the connecting fabric. Note that the UV client is much bigger than<br>

the other nodes.<br>

<br>

<br>

Slide 6: Mono-rail Single Fabric<br>

<br>

The kind of fabric we have today. The UV is starved for I/O.<br>

<br>

<br>

Slide 7: LNets in a Single Fabric<br>

<br>

You can make additional interfaces in the UV useful by defining<br>

multiple LNets in the fabric, and then carefully setting up aliases on<br>

the nodes with only a single interface. This can be done today, but<br>

setting this up correctly is a bit tricky, and involves cluster-wide<br>

configuration. It is not something you'd like to have to retrofit top<br>

an existing cluster.<br>

<br>

<br>

Slide 8: Multi-rail Single Fabric<br>

<br>

An example of a fabric topology that we want to work well. Some nodes<br>

have multiple interfaces, and when they do they can all be used to<br>

talk to the other nodes.<br>

<br>

<br>

Slide 9: Multi-rail Dual Fabric<br>

<br>

Similar to previous slide, but now with more LNets. Here too the goal<br>

is active-active use of the LNets and all interfaces.<br>

<br>

<br>

Slide 10: Mixed-Version Clusters<br>

<br>

This section of the presentation expands on the first item of Slide 4.<br>

<br>

<br>

Slide 11: A Single Multi-Rail Node<br>

<br>

Assume we install multi-rail capable Lustre only on the UV. Would that<br>

work? It turns out that it should actually work, though there are some<br>

limits to the functionality. In particular, the MGS/MDS/OSS nodes will<br>

not be aware that they know the UV by a number of NIDs, and it may be<br>

best to avoid this by ensuring that the UV always uses the same<br>

interface to talk to a particular node. This gives us the same<br>

functionality as the multiple LNet example of Slide 7, but with a much<br>

less complicated configuration.<br>

<br>

<br>

Slide 12: Peer Version Discovery<br>

<br>

A multi-rail capable node would like to know if any peer node is also<br>

multi-rail capable. The LNet protocol itself isn't properly versioned,<br>

but the LNet ping protocol (not to be confused with the ptlrpc<br>

pinger!) does transport a feature flags field. There are enough bits<br>

available in that field that we can just steal one and use it to<br>

indicate multi-rail capability in a ping reply.<br>

<br>

Note that a ping request does not carry any information beyond the<br>

source NID of the requesting node. In particular, it cannot carry<br>

version information to the node being pinged.<br>

<br>

<br>

Slide 13: Peer Version Discovery<br>

<br>

A simple version discovery protocol can be built on LNet ping.<br>

<br>

   1) LNet keeps track of all known peers<br>

   2) On first communication, do an LNet ping<br>

   3) The node now knows the peer version<br>

<br>

And we get a list of the peer's interfaces for free.<br>

<br>

<br>

Slide 14: Easy Configuration<br>

<br>

This section of the presentation expands on the second item of Slide 4.<br>

<br>

<br>

Slide 15: Peer Interface Discovery<br>

<br>

The list of interfaces of a peer is all we need to know for the simple<br>

cases. With that we know the peer under all its aliases, and can<br>

determine whether any of the other local interfaces (for example those<br>

on different LNets) can talk to the same peer.<br>

<br>

Now the peer also needs to know the node's interfaces. It would be<br>

nice if there was a reliable way to get the peer to issue an LNet ping<br>

to the node. For the most basic situation this works, but once I<br>

looked at more complex situations it became clear that this cannot be<br>

done reliably. So instead I propose to just have the node push a list<br>

of its interfaces to the peer.<br>

<br>

<br>

Slide 16: Peer Interface Discovery<br>

<br>

The push of the list is much like an LNet ping, except it does an<br>

LNetPut() instead of an LNetGet().<br>

<br>

This should be safe on several grounds. An LNet router doesn't do deep<br>

inspection of Put/Get requests, so even a downrev router will be able<br>

to forward them. If such a Put somehow ends up at a downrev peer, the<br>

peer will silently drop the message. (The slide says a protocol error<br>

will be returned, which is wrong.)<br>

<br>

<br>

Slide 17: Configuring Interfaces on a Node<br>

<br>

How does a node know its own interfaces? This can be done in a way<br>

similar to the current methods: kernel module parameters and/or<br>

DLC. These use the same in-kernel parser, so the syntax is similar in<br>

either case.<br>

<br>

    networks=o2ib(ib0,ib1)<br>

<br>

This is an example where two interfaces are used in the same LNet.<br>

<br>

    networks=o2ib(ib0[2],ib1[6])[2,6]<br>

<br>

The same example annotated with CPT information. This refers back to<br>

Slide 3: on a big NUMA system it matters to be able to place the<br>

helper threads for an interface close to that interface.<br>

<br>

* Of course that information is also available in the kernel, and with<br>

   a few extensions to the CPT mechanism, the kernel could itself find<br>

   the node to which an interface is connected, then find the/a CPT<br>

   that contains CPUs on that node.<br>

<br>

<br>

Slide 18: Configuring Interfaces on a Node<br>

<br>

LNet uses credits to determine whether a node can send something<br>

across an interface or to a peer. These credits are assigned<br>

per-interface, for both local and peer credits. So more interfaces<br>

means more credits overall. The defaults for credit-related tunables<br>

can stay the same. On LNet routers, which do have multiple interfaces,<br>

these tunables are already interpreted per interface.<br>

<br>

<br>

Slide 19: Dynamic Configuration<br>

<br>

There is some scope for supporting hot-plugging interfaces. When<br>

adding an interface, enable then push. When removing an interface,<br>

push then disable.<br>

<br>

Note that removing the interface with the NID by which a node is known<br>

to the MGS (MDS/...) might not be a good idea. If additional<br>

interfaces are present then existing connections can remain active,<br>

but establishing new ones becomes a problem.<br>

<br>

* This is a weakness of this proposal.<br>

<br>

<br>

Slide 20: Adaptable<br>

<br>

This section of the presentation expands on the third item of Slide 4.<br>

<br>

<br>

Slide 21: Interface Selection<br>

<br>

Selecting a local interface to send from, and a peer interface to send<br>

to can use a number of rules.<br>

<br>

- Direct connection preferred: by default, don't go through an LNet<br>

   router unless there is no other path. Note that today an LNet router<br>

   will refuse to forward traffic if it believes there is a direct<br>

   connection between the node and the peer.<br>

<br>

- LNet network type: since using TCP only is the default, it also<br>

   makes sense to have a default rule that if a non-TCP network has been<br>

   configured, then that network should be used first. (As with all such<br>

   rules, it must be possible to override this default.)<br>

<br>

- NUMA criteria: pick a local interface that (i) can reach the peer,<br>

   (ii) is close to the memory used for the I/O, and (iii) close to the<br>

   CPU driving the I/O.<br>

<br>

- Local credits: pick a local interface depending on the availability<br>

   of credits. Credits are a useful indicator for how busy an interface<br>

   is. Systematically choosing the interface with the most available<br>

   credits should get you something resembling a round-robin<br>

   strategy. And this can even be used to balance across heterogeneous<br>

   interfaces/fabrics.<br>

<br>

- Peer credits: pick a peer interface depending on the availability of<br>

   peer credits. Then pick a local interface that connects to this peer<br>

   interface.<br>

<br>

- Other criteria, namely...<br>

<br>

<br>

Slice 22: Routing Enhancements<br>

<br>

The fabric connecting nodes in a cluster can have a complicated<br>

topology. So can have cases where a node has two interfaces N1,N2, and<br>

a peer has two interfaces P1,P2, all on the same LNet, yet N1-P1 and<br>

N2-P2 are preferred paths, while N1-P2 and N2-P1 should be avoided.<br>

<br>

So there should be ways to define preferred point-to-point connections<br>

within an LNet. This solves the N1-P1 problem mentioned above.<br>

<br>

There also need to be ways to define a preference for using one LNet<br>

over another, possibly for a subset of NIDs. This is the mechanism by<br>

which the "anything but TCP" default can be overruled.<br>

<br>

The existing syntax for LNet routing can easily(?) be extended to<br>

cover these cases.<br>

<br>

<br>

Slide 23: Extra Considerations<br>

<br>

As you may have noticed, I'm looking for ways to be NUMA friendly. But<br>

one thing I want to avoid is having Lustre nodes know too much about<br>

the topology of their peers. How much is too much? I draw the line at<br>

them knowing anything at all.<br>

<br>

At the PortalRPC level each RPC is a request/response pair. (This in<br>

contrast to the LNet level put/ack and get/reply pairs that make up<br>

the request and the response.)<br>

<br>

The PortalRPC layer is told the originating interface of a request. It<br>

then sends the response to that same interface. The node sending the<br>

request is usually a client -- especially when a large data transfer is<br>

involved -- and this is a simple way to ensure that whatever NUMA-aware<br>

policies it used to select the originating interface are also honored<br>

when the response arrives.<br>

<br>

<br>

Slide 24: Extra Considerations<br>

<br>

If for some reason the peer cannot send a message to the originating<br>

interface, then any other interface will do. This is an event worth<br>

logging, as it indicates a malfunction somewhere, and after that just<br>

keeping the cluster going should be the prime concern.<br>

<br>

Trying all local-remote interface pairs might not be a good idea:<br>

there can be too many combinations and the cumulative timeouts become<br>

a problem.<br>

<br>

To avoid timeouts at the PortalRPC level, LNet may already need to<br>

start resending a message long before the "offical" below-LND-level<br>

timeout for message arrival has expired.<br>

<br>

The added network resiliency is limited. As noted for Slide 19, if the<br>

interface that fails is has the NID by which a node is primarily<br>

known, establishing new connections to that node becomes impossible.<br>

<br>

<br>

Slide 25: Extra Considerations<br>

<br>

Failing nodes can be used to construct some very creative<br>

scenarios. For example if a peer reboots with downrev software LNet on<br>

a node will not be able to tell by itself. But in this case the<br>

PortalRPC layer can signal to LNet that it needs to re-check the peer.<br>

<br>

NID reuse by different nodes is also a scenario that introduces a lot<br>

complications. (Arguably it does do this already today.)<br>

<br>

If needed, it might be possible to sneak a 32 bit identifying cookie<br>

into the NID each node reports on the loopback network. Whether this<br>

would actually be useful (and for that matter how such cookies would<br>

be assigned) is not clear.<br>

<br>

<br>

Slide 26: LNet-level Implementation<br>

<br>

This section of the presentation expands on the fourth item of Slide 4.<br>

<br>

<br>

Slide 27-29: Implementation Notes<br>

<br>

A staccato of notes on how to implement bits and pieces of the above.<br>

There's too much text in the slides already, so I'm not paraphrasing.<br>

<br>

<br>

Slide 30: Implementation Notes<br>

<br>

This slide gives a plausible way to cut the work into smaller pieces<br>

that can be implemented as self-contained bits.<br>

<br>

    1) Split lnet_ni<br>

    2) Local interface selection<br>

    *) Routing enhancements for local interface selection<br>

    3) Split lnet_peer<br>

    4) Ping on connect<br>

    5) Implement push<br>

    6) Peer interface selection<br>

    7) Resending on failure<br>

    8) Routing enhancements<br>

<br>

There's of course no guarantee that this division will survive the<br>

actual coding. But if it does, then note that after step 2 is<br>

implemented, the configuration of Slide 11 (single multi-rail node)<br>

should already be working.<br>

<br>

<br>

Slide 31: Feedback & Discussion<br>

<br>

Looking forward to further feedback & discussion here.<br>

<br>

<br>

Slide 32:<br>

<br>

End title - also boring.<span class="HOEnZb"><font color="#888888"><br>

<br>

<br>

Olaf<br>

<br>

-- <br>

Olaf Weber                 SGI               Phone:  <a href="tel:%2B31%280%2930-6696796" value="+31306696796" target="_blank">+31(0)30-6696796</a><br>

                           Veldzigt 2b       Fax:    <a href="tel:%2B31%280%2930-6696799" value="+31306696799" target="_blank">+31(0)30-6696799</a><br>

Sr Software Engineer       3454 PW de Meern  Vnet:   955-6796<br>

Storage Software           The Netherlands   Email:  <a href="mailto:olaf@sgi.com" target="_blank">olaf@sgi.com</a><br>

_______________________________________________<br>

lustre-devel mailing list<br>

<a href="mailto:lustre-devel@lists.lustre.org" target="_blank">lustre-devel@lists.lustre.org</a><br>

<a href="http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org</a><br>

</font></span></blockquote></div><br></div>