[lustre-discuss] LNET Routing Question

Makia Minich makia at systemfabricworks.com
Mon May 28 10:35:24 PDT 2018


Thanks a ton for this information, extremely helpful.

—

Makia Minich
Principal Architect
System Fabric Works
"Fabric Computing that Works”

Mobile: (865) 964-7939
Office: (303) 335-9684

"Oh, I don't know. I think everything is just as it should be, y'know?”
- Frank Fairfield

> On May 23, 2018, at 2:06 PM, Chris Horn <hornc at cray.com> wrote:
> 
> Hello,
>  
> I agree as others have stated that we would not expect the loss of a router to significantly affect the I/O destined for filesystems served by other routers, nor would we expect the I/O destined for non-routed filesystems to be affected. However, I can say that we have seen bugs in this area in the past where the loss of a remote filesystem (the servers, not the routers serving that filesystem) did affect access to other filesystems. If I recall correctly the issue was that resources were being consumed on the routers in trying to communicate with the lost filesystem. That resource consumption caused I/O destined for other filesystems to get backed up. I’m not aware of any outstanding issues like this, and I’ll stress that that sort of behavior would certainly be considered a bug. So please let us know if you see any issues.
>  
> Regarding check_routers_before_use, this parameter affects how the LNet router checker behaves upon startup. The router checker on an LNet peer works by periodically sending an LNet ping to each known router. If a peer receives a response from the router within a timeout period then the router is considered alive, otherwise it is considered dead and routes hosted by that router are removed from the routing table (until it starts responding to the pings). By default, all routers are initially considered to be up (alive), and all routes are immediately eligible for sends. When check_routers_before_use is enabled (set to “1”) all routers are instead initially considered down (dead), and all routes must first respond to an LNet level ping before the route becomes eligible for sends.
>  
> The use of this parameter should not affect the scenarios you describe. Traffic destined for local networks is not affected by the up or down (alive or dead) states of routers.
>  
> Chris Horn
>  
> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org <mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of Makia Minich <makia at systemfabricworks.com <mailto:makia at systemfabricworks.com>>
> Date: Wednesday, May 9, 2018 at 8:51 AM
> To: "lustre-discuss at lists.lustre.org <mailto:lustre-discuss at lists.lustre.org>" <lustre-discuss at lists.lustre.org <mailto:lustre-discuss at lists.lustre.org>>
> Subject: [lustre-discuss] LNET Routing Question
>  
> Hello all,  <>
>  
> I have an LNET routing question. I’ve attached a quick diagram of the current setup; but basically I have two core networks (one infiniband and one ethernet) with a set of LNET routers in between. There is storage and clients on both sides of these routers and all clients need to see all/most storage. All connections, configurations, etc are all working.
>  
> The question is, if an LNET router goes down (which does cause some amount of reconnect or remapping for any clients attempting to use those routes) would this cause any issues or delays for a client’s connection to non-routed storage? Put slightly different, if a job on the ethernet clients is actively using ethernet storage and the lnet routers go down, will job be affected? What about a new job just launching when that lnet router is down?
>  
> In addition, what does “check_routers_before_use” actually do and does it change the scenarios I mentioned? (e.g. If an ethernet client has “check_routers_before_use” would every file request start with a ping to the routers even if it’s not leaving it’s core network?)
>  
> Thanks!
>  
> <image001.png>
>>  
> Makia Minich
> Principal Architect
> System Fabric Works
> "Fabric Computing that Works”
> 
> "Oh, I don't know. I think everything is just as it should be, y'know?”
> - Frank Fairfield
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180528/a18c7e3e/attachment.html>


More information about the lustre-discuss mailing list