[lustre-devel] Multi-rail networking for Lustre
Olaf Weber
olaf at sgi.com
Fri Jan 22 06:31:14 PST 2016
On 22-01-16 10:08, Alexey Lyashkov wrote:
>
>
> On Thu, Jan 21, 2016 at 11:30 PM, Olaf Weber <olaf at sgi.com
> <mailto:olaf at sgi.com>> wrote:
>
> On 21-01-16 20:16, Alexey Lyashkov wrote:
[...]
> In lustre terms each mount point is separated client. It have own cache, own
> structures, and completely separated each from an other.
> One exceptions it's ldlm cache which live on global object id space.
Another exception is flock deadlock detection, which is always a global
operation. This is why ldlm_flock_deadlock() inspects c_peer.nid.
[...]
> All lustre stack operate with UUID, and it have none differences when it
> UUID live. We may migrate service / client from one network address to
> another, without logical reconnect. It's my main objections against you ideas.
> If none have a several addresses LNet should be responsible to reliability
> delivery a one-way requests. which is logically connect to PtlRPC. If node
> will be need to use different routing and different NID's for communication
> - it's should be hide in LNet, and LNet should provide as high api as possible.
The basic idea behind the multi-rail design is that LNet figures out how to
send a message to a peer. But the user of LNet can provide a hint to
indicate that for a specific message a specific path is preferred.
One of our goals is to keep changes to the LNet API small.
> I expect you know about situation when one DNS name have several
> addresses
> like several 'A' records in dns zone file.
>
>
> Sure, but when one name points to several machines, it does not help me
> balance traffic over the interfaces of just one machine.
>
>
> Simple balance may be DNS based - just round robin, as we have now on IB /
> sock lnd. it isn't balance?
> If you talk about more serious you should start from good flow control
> between nodes. Probably Ideas from RIP and LACK protocols will help.
There is bonding/balancing in socklnd. There is none in o2iblnd.
[...]
> A PtlRPC RPC has structure. The first LNetPut() transmits just the
> header information. Then one or more LNetPut() or LNetGet() messages are
> done to transmit the rest of the request. Then the response follows,
> which also consists of several LNetPut() or LNetGet() messages.
>
> It's wrong. Looks you mix an RPC and bulk transfers.
Difference in terminology: I tend to think of an RPC as a request/response
pair (if there is a response), and these in turn include all traffic related
to the RPC, including any bulk transfers.
[...]
> The lustre_uuid_to_peer() function enumerates all NIDs associated with
> the UUID. This includes the primary NID, but also includes the other
> NIDs. So we find a preferred peer NID based on that. Then we modify the
> code like this:
>
> Why PtlRPC should be know that low level details? Currently we have a
> problems - when one of destination NID's is unreachable and transfer
> initiator need a full ptlrpc reconnect to resend to different NID. But as
> you should be have a resend
Within LNet a resend can be triggered from lnet_finalize() after a failed
attempt to send the message has been decommitted. (Otherwise multiple send
attempts will need to be tracked at the same time.)
> The call of LNetPrimaryNID() gives the primary peer NID for the peer
> NID. For this to work a handful of calls to LNetPrimaryNID() must be
> added. After that it is up to LNet to find the best route.
>
>
> Per our's comment PrimaryNID will changed after we will find a best, did you
> think it loop usefull if you replace loop result at anycases ?
> from other view ptlrpc_uuid_to_peer called only in few cases, all other time
> ptlrpc have a cache a results in ptlrpc connection info.
The main benefit of the loop becomes detecting whether the node is sending
to itself, in which case the loopback interface must be used. Though I do
worry about degenerate or bad configurations where not all the IP addresses
belong to the same node.
--
Olaf Weber SGI Phone: +31(0)30-6696796
Veldzigt 2b Fax: +31(0)30-6696799
Sr Software Engineer 3454 PW de Meern Vnet: 955-6796
Storage Software The Netherlands Email: olaf at sgi.com
More information about the lustre-devel
mailing list