[lustre-discuss] Lustre routing help needed

Kevin M. Hildebrand kevin at umd.edu
Mon Oct 30 05:47:20 PDT 2017


Hello, I'm trying to set up some new Lustre routers between a set of
Infiniband connected Lustre servers and a few hosts connected to an
external 100G Ethernet network.   The problem I'm having is that the
routers work just fine for a minute or two, and then shortly thereafter
they're marked as 'down' and all traffic stops.  If I unload/reload the
lustre modules on the router, it'll work again for a short time and then
stop again.  The router shows errors like:
[236528.801275] LNetError: 54389:0:(lib-move.c:2120:lnet_parse_get())
10.10.104.2 at tcp2: Unable to send REPLY for GET from 12345-10.10.104.201 at tcp2:
-113

My Lustre router has a Mellanox ConnectX-3 interface connecting to the
Lustre servers, and a Mellanox ConnectX-5
​100G ​
interface connecting to a 100G switch to which my test client is connected.
​  ​
On the Infiniband side, I've got
​lnet​
​ configured as o2ib1
​​
, and on the Ethernet side, as tcp2.

Clients and servers are all running Lustre 2.8.  The Lustre router at the
moment is running Lustre 2.10.1, because of software dependencies to
support the 100G card.

I've verified that I have stable network connectivity on both the IB and
Ethernet sides.

At the moment, I have very simple lnet configurations, using the built in
defaults.  lnet.conf on the server:
options lnet ip2nets="o2ib1(ib0) 192.168.[64-95].*; tcp1
10.103.[128-159].*" routes="tcp0 192.168.64.[78-79]@o2ib1; tcp2
192.168.64.[78-79]@o2ib1"

On the lustre router:
options lnet networks="o2ib1(ib0),tcp2(p1p1.104)" "forwarding=enabled"

And on the client:
options lnet networks="tcp2(p4p1.104)" routes="o2ib1 10.10.104.[2-3]@tcp2"

All of the hosts (client, server, router) have the following in
ko2iblnd.conf:

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe


Does anyone see anything I've missed, or have any thoughts on where I
should look next?

Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland, College Park
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171030/11e6559d/attachment.html>


More information about the lustre-discuss mailing list