[lustre-discuss] avoid_asym_router_failure

Matt Rásó-Barnett matt at rasobarnett.com
Wed Jun 27 04:29:00 PDT 2018


Hi all,

I just experienced our first asymmetric router failure today, where only 
one interface on a subset of our LNET routers was down - however, only 
our clients connected directly via that interface detected this.  
Unfortunately our Lustre servers, connected to the routers via a 
different, functional interface didn't mark these routers as down, even 
though I thought we had a configuration that mitigated this problem by 
setting all our clients and servers to have the LNET module parameter:

avoid_asym_router_failure=1

As I understand it, the router pinger on our clients and servers should 
detect that one of the router's NIDs is down in this scenario and then 
mark the router down.

Should this parameter be set on the routers as well to be effective 
(which doesn't seem like it from my understanding of what the parameter 
does)?

In my case, the router's IB interface was showing as the state 'DOWN' in 
ibstatus and wasn't flapping, so I would have expected the router pinger 
on the server to detect this?

My lnet configurations are below for further information:

-------------------------------------------
Servers
-------
lustre-2.7.21 (Not quite the latest IEEL 3.X release)

options lnet networks="o2ib1(ib0)" routes="tcp2 1 
10.47.240.[161-168]@o2ib1; tcp4 1 10.47.240.[161-168]@o2ib1; o2ib0 1 
10.47.240.[165-168]@o2ib1; o2ib2 1 10.47.240.[161-168]@o2ib1" auto_down
=1 avoid_asym_router_failure=1 check_routers_before_use=1 
dead_router_check_interval=60 live_router_check_interval=60 
router_ping_timeout=60


Routers
-------
lustre-2.7.21

options lnet networks="o2ib1(ib0), o2ib2(ib1), tcp2(em1.43), 
tcp4(em1.40)" 


Clients
-------
lustre-client-2.10.3-1

options lnet networks=o2ib2(ib0) routes="o2ib1 1 
10.44.240.[161-168]@o2ib2; o2ib0 1 10.44.240.[165-168]@o2ib2" 
auto_down=1 avoid_asym_router_failure=1 check_routers_before_use=1 
dead_router_check_interval=60 live_router_check_interval=60 
router_ping_timeout=60

-------------------------------------------

Has anyone else experienced this problem before and has this parameter 
worked properly for you?

Could anyone suggest ways I could observe what the router pinger is 
doing? Would that be looking for messages in lustre debug logs - perhaps 
just with 'net' set in the debug mask?

Thanks,
Matt

-- 
Matt Rásó-Barnett
Research Computing Platforms
University Information Services
University of Cambridge


More information about the lustre-discuss mailing list