[lustre-discuss] Lustre routing help needed

Dilger, Andreas andreas.dilger at intel.com
Mon Oct 30 14:47:04 PDT 2017


The 2.10 release added support for multi-rail LNet, which may potentially be causing problems here. I would suggest to install an older LNet version on your routers to match your client/server.

You may need to build your own RPMs for your new kernel, but can use --disable-server for configure to simplify things.

Cheers, Andreas

On Oct 31, 2017, at 04:45, Kevin M. Hildebrand <kevin at umd.edu<mailto:kevin at umd.edu>> wrote:

Thanks, I completely missed that.  Indeed the ko2iblnd parameters were different between the servers and the router.  I've updated the parameters on the router to match those on the server, and things haven't gotten any better.  (The problem appears to be on the Ethernet side anyway, so you've probably helped me fix a problem I didn't know I had...)
I don't see much discussion about configuring lnet parameters for Ethernet networks, I assume that's using ksocklnd.  On that side, it appears that all of the ksocklnd parameters match between the router and clients.  Interesting that peer_timeout is 180, which is almost exactly when my client gets marked down on the router.

Server (and now router) ko2iblnd parameters:
peer_credits 8
peer_credits_hiw 4
credits 256
concurrent_sends 8
ntx 512
map_on_demand 0
fmr_pool_size 512
fmr_flush_trigger 384
fmr_cache 1

Client and router ksocklnd:
peer_timeout 180
peer_credits 8
keepalive 30
sock_timeout 50
credits 256
rx_buffer_size 0
tx_buffer_size 0
keepalive_idle 30
round_robin 1
sock_timeout 50

Thanks,
Kevin


On Mon, Oct 30, 2017 at 4:16 PM, Mohr Jr, Richard Frank (Rick Mohr) <rmohr at utk.edu<mailto:rmohr at utk.edu>> wrote:

> On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand <kevin at umd.edu<mailto:kevin at umd.edu>> wrote:
>
> All of the hosts (client, server, router) have the following in ko2iblnd.conf:
>
> alias ko2iblnd-opa ko2iblnd
> options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
>
> install ko2iblnd /usr/sbin/ko2iblnd-probe

Those parameters will only get applied to omnipath interfaces (which you don’t have), so everything you have should just be running with default parameters.  Since your lnet routers have a different version of lustre than your servers/clients, it might be possible that the default values for the ko2iblnd parameters are different between the two versions.  You can always check this by looking at the values in the files under /sys/module/ko2iblnd/parameters.  It might be worthwhile to compare those values on the lnet routers to the values on the servers to see if maybe there is a difference that could affect the behavior.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu


_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171030/e8ed0b58/attachment-0001.html>


More information about the lustre-discuss mailing list