<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jan 22, 2016 at 5:31 PM, Olaf Weber <span dir="ltr"><<a href="mailto:olaf@sgi.com" target="_blank">olaf@sgi.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">On 22-01-16 10:08, Alexey Lyashkov wrote:<br>

</span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">

<br>

<br>

On Thu, Jan 21, 2016 at 11:30 PM, Olaf Weber <<a href="mailto:olaf@sgi.com" target="_blank">olaf@sgi.com</a><br></span><span class="">

<mailto:<a href="mailto:olaf@sgi.com" target="_blank">olaf@sgi.com</a>>> wrote:<br>

<br>

    On 21-01-16 20:16, Alexey Lyashkov wrote:<br>

</span></blockquote>

<br>

[...]<span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

In lustre terms each mount point is separated client. It have own cache, own<br>

structures, and completely separated each from an other.<br>

One exceptions it's ldlm cache which live on global object id space.<br>

</blockquote>

<br></span>

Another exception is flock deadlock detection, which is always a global operation. This is why ldlm_flock_deadlock() inspects c_peer.nid.<br>

<br></blockquote><div>flock is part of ldlm.</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

[...]<span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

All lustre stack operate with UUID, and it have none differences when it<br>

UUID live. We may migrate service / client from one network address to<br>

another, without logical reconnect. It's my main objections against you ideas.<br>

If none have a several addresses LNet should be responsible to reliability<br>

delivery a one-way requests. which is logically connect to PtlRPC. If node<br>

will be need to use different routing and different NID's for communication<br>

- it's should be hide in LNet, and LNet should provide as high api as possible.<br>

</blockquote>

<br></span>

The basic idea behind the multi-rail design is that LNet figures out how to send a message to a peer. But the user of LNet can provide a hint to indicate that for a specific message a specific path is preferred.<br>

<br></blockquote><div>it's good idea, but routing is part it idea. Routing may changed at any time but ptlrpc should avoid resending request as it will be need a full logical reconnect and it operation isn't fast and isn't light for a server.</div><div>What hint you want from PtlRPC? it have a single responsible - send to some UUID a some buffer. </div><div>PtlRPC (or some upper code) may send QoS hint, probably it may send hint about local memory, to avoid access from other NUMA node, what else?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

One of our goals is to keep changes to the LNet API small.</blockquote><div><br></div><div>And to avoid it you add one more conversion UUID -> abstract NID -> ... real NID.</div><div>while a direct conversion UUID -> .. real NID is possible and provide better result in case you need a hide a network topology changes.</div><div><br></div><div><br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

        I expect you know about situation when one DNS name have several<br>

        addresses<br>

        like several 'A' records in dns zone file.<br>

<br>

<br>

    Sure, but when one name points to several machines, it does not help me<br>

    balance traffic over the interfaces of just one machine.<br>

<br>

<br>

Simple balance may be DNS based - just round robin, as we have now on IB /<br>

sock lnd. it isn't balance?<br>

</blockquote>

> If you talk about more serious you should start from good flow control<br>

> between nodes. Probably Ideas from RIP and LACK protocols will help.<br>

<br></span>

There is bonding/balancing in socklnd. There is none in o2iblnd.<br>

<br></blockquote><div>you are one of inspectors for an <a href="http://review.whamcloud.com/#/c/14625/">http://review.whamcloud.com/#/c/14625/</a>. and i don't see any architecture objections except an fix a some bugs in code. So I may say - it's exist and Fujitsu's uses it in production (it was present on LAD devel summit).</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

[...]<span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

    A PtlRPC RPC has structure. The first LNetPut() transmits just the<br>

    header information. Then one or more LNetPut() or LNetGet() messages are<br>

    done to transmit the rest of the request. Then the response follows,<br>

    which also consists of several LNetPut() or LNetGet() messages.<br>

<br>

It's wrong. Looks you mix an RPC and bulk transfers.<br>

</blockquote>

<br></span>

Difference in terminology: I tend to think of an RPC as a request/response pair (if there is a response), and these in turn include all traffic related to the RPC, including any bulk transfers.</blockquote><div><br></div><div>"The first LNetPut() transmits just the header information." what you mean about "header". If we talk about bulk transfer protocol - first LNetPut will transfer an RPC body which uses to setup an bulk transfer.<br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

[...]<span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

    The lustre_uuid_to_peer() function enumerates all NIDs associated with<br>

    the UUID. This includes the primary NID, but also includes the other<br>

    NIDs. So we find a preferred peer NID based on that. Then we modify the<br>

    code like this:<br>

<br>

Why PtlRPC should be know that low level details? Currently we have a<br>

problems - when one of destination NID's is unreachable and transfer<br>

initiator need a full ptlrpc reconnect to resend to different NID. But as<br>

you should be have a resend<br>

</blockquote>

<br></span>

Within LNet a resend can be triggered from lnet_finalize() after a failed attempt to send the message has been decommitted. (Otherwise multiple send attempts will need to be tracked at the same time.)</blockquote><div>PtlRPC and LNet may have a different timeout window for now. So timeout on PtlRPC can smaller then LNet LND.</div><div>in that case you will have a lots TX in flight where is new TX allocated. That is tracking a multiple send attempts in same time.</div><div>But it's not a good from request latency perspective but easy way to change a routing for now.</div><div>With single "Primary NID" you will lack this functionality or need to implement something similar.</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

    The call of LNetPrimaryNID() gives the primary peer NID for the peer<br>

    NID. For this to work a handful of calls to LNetPrimaryNID() must be<br>

    added. After that it is up to LNet to find the best route.<br>

<br>

<br>

Per our's comment PrimaryNID will changed after we will find a best, did you<br>

think it loop usefull if you replace loop result at anycases ?<br>

from other view ptlrpc_uuid_to_peer called only in few cases, all other time<br>

ptlrpc have a cache a results in ptlrpc connection info.<br>

</blockquote>

<br></span>

The main benefit of the loop becomes detecting whether the node is sending to itself, in which case the loopback interface must be used. Though I do worry about degenerate or bad configurations where not all the IP addresses belong to the same node.<div class=""><div class="h5"><br></div></div></blockquote><div>You can work without loopback driver - it was checked with socklnd and work fine. But currently it loop used to make connection sorting in right order. Fast connections on top. But it have assumption - network topology have a no changes at run time.</div><div><br></div><div><br></div></div><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr">Alexey Lyashkov <strong>·</strong> Technical lead for a Morpheus team<br>

Seagate Technology, LLC<br>

<a href="http://www.seagate.com" target="_blank">www.seagate.com</a><br><div><a href="http://www.lustre.org" target="_blank">www.lustre.org</a></div></div></div>

</div></div>