[lustre-discuss] Lustre routing help needed

Kevin M. Hildebrand kevin at umd.edu
Mon Oct 30 10:30:31 PDT 2017


I received a reply from Alejandro suggesting that I check
live_router_check_interval, dead_router_check_interval and
router_ping_timeout.
I had those set to the defaults, which I assume are 60, 60, and 50 seconds
respectively.  I did just try setting those values explicitly, and I'm not
seeing any better behavior.
>From watching /proc/sys/lnet/routers on the client, I see that the client
is indeed sending router pings every 60 seconds.  On the router itself,
watching /proc/sys/lnet/peers immediately after doing 'lctl net down; lctl
net up', I see the 'last' column for my test client count from 0 up to
around 180, at which point the client is marked 'down'.  (For the other
peers, all of which are servers, the values count from 0 to around 180 and
then reset to 0, remaining 'up')
Is the 'last' column reflecting the last time the router has received a
'ping' from that peer?  If so, why do the numbers count to 180 instead of
60, which is the frequency they're being sent?

Thanks,
Kevin

On Mon, Oct 30, 2017 at 8:47 AM, Kevin M. Hildebrand <kevin at umd.edu> wrote:

> Hello, I'm trying to set up some new Lustre routers between a set of
> Infiniband connected Lustre servers and a few hosts connected to an
> external 100G Ethernet network.   The problem I'm having is that the
> routers work just fine for a minute or two, and then shortly thereafter
> they're marked as 'down' and all traffic stops.  If I unload/reload the
> lustre modules on the router, it'll work again for a short time and then
> stop again.  The router shows errors like:
> [236528.801275] LNetError: 54389:0:(lib-move.c:2120:lnet_parse_get())
> 10.10.104.2 at tcp2: Unable to send REPLY for GET from
> 12345-10.10.104.201 at tcp2: -113
>
> My Lustre router has a Mellanox ConnectX-3 interface connecting to the
> Lustre servers, and a Mellanox ConnectX-5
> ​100G ​
> interface connecting to a 100G switch to which my test client is connected.
> ​  ​
> On the Infiniband side, I've got
> ​lnet​
> ​ configured as o2ib1
> ​​
> , and on the Ethernet side, as tcp2.
>
> Clients and servers are all running Lustre 2.8.  The Lustre router at the
> moment is running Lustre 2.10.1, because of software dependencies to
> support the 100G card.
>
> I've verified that I have stable network connectivity on both the IB and
> Ethernet sides.
>
> At the moment, I have very simple lnet configurations, using the built in
> defaults.  lnet.conf on the server:
> options lnet ip2nets="o2ib1(ib0) 192.168.[64-95].*; tcp1
> 10.103.[128-159].*" routes="tcp0 192.168.64.[78-79]@o2ib1; tcp2
> 192.168.64.[78-79]@o2ib1"
>
> On the lustre router:
> options lnet networks="o2ib1(ib0),tcp2(p1p1.104)" "forwarding=enabled"
>
> And on the client:
> options lnet networks="tcp2(p4p1.104)" routes="o2ib1 10.10.104.[2-3]@tcp2"
>
> All of the hosts (client, server, router) have the following in
> ko2iblnd.conf:
>
> alias ko2iblnd-opa ko2iblnd
> options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024
> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048
> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
>
> install ko2iblnd /usr/sbin/ko2iblnd-probe
>
>
> Does anyone see anything I've missed, or have any thoughts on where I
> should look next?
>
> Thanks,
> Kevin
>
> --
> Kevin Hildebrand
> University of Maryland, College Park
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171030/eb142860/attachment-0001.html>


More information about the lustre-discuss mailing list