[lustre-discuss] Lustre routing help needed

LOPEZ, ALEXANDRE alexandre.lopez at atos.net
Mon Oct 30 06:09:33 PDT 2017


Hi Kevin,

Just wild-guessing here. Have you tried playing with the live_router_check_interval, dead_router_check_interval and router_ping_timeout LNet parameters?

HTH,
Alejandro

From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Kevin M. Hildebrand
Sent: Monday, October 30, 2017 1:47 PM
To: lustre-discuss at lists.lustre.org
Subject: [lustre-discuss] Lustre routing help needed

Hello, I'm trying to set up some new Lustre routers between a set of Infiniband connected Lustre servers and a few hosts connected to an external 100G Ethernet network.   The problem I'm having is that the routers work just fine for a minute or two, and then shortly thereafter they're marked as 'down' and all traffic stops.  If I unload/reload the lustre modules on the router, it'll work again for a short time and then stop again.  The router shows errors like:
[236528.801275] LNetError: 54389:0:(lib-move.c:2120:lnet_parse_get()) 10.10.104.2 at tcp2<mailto:10.10.104.2 at tcp2>: Unable to send REPLY for GET from 12345-10.10.104.201 at tcp2<mailto:12345-10.10.104.201 at tcp2>: -113
My Lustre router has a Mellanox ConnectX-3 interface connecting to the Lustre servers, and a Mellanox ConnectX-5
​100G ​
interface connecting to a 100G switch to which my test client is connected.
​  ​
On the Infiniband side, I've got
​lnet​
​ configured as o2ib1
​​
, and on the Ethernet side, as tcp2.

Clients and servers are all running Lustre 2.8.  The Lustre router at the moment is running Lustre 2.10.1, because of software dependencies to support the 100G card.

I've verified that I have stable network connectivity on both the IB and Ethernet sides.

At the moment, I have very simple lnet configurations, using the built in defaults.  lnet.conf on the server:
options lnet ip2nets="o2ib1(ib0) 192.168.[64-95].*; tcp1 10.103.[128-159].*" routes="tcp0 192.168.64.[78-79]@o2ib1; tcp2 192.168.64.[78-79]@o2ib1"

On the lustre router:
options lnet networks="o2ib1(ib0),tcp2(p1p1.104)" "forwarding=enabled"

And on the client:
options lnet networks="tcp2(p4p1.104)" routes="o2ib1 10.10.104.[2-3]@tcp2"

All of the hosts (client, server, router) have the following in ko2iblnd.conf:

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe


Does anyone see anything I've missed, or have any thoughts on where I should look next?

Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland, College Park
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171030/25a4a773/attachment-0001.html>


More information about the lustre-discuss mailing list