[lustre-discuss] Lustre routing help needed

Kevin M. Hildebrand kevin at umd.edu
Mon Oct 30 13:44:52 PDT 2017


Thanks, I completely missed that.  Indeed the ko2iblnd parameters were
different between the servers and the router.  I've updated the parameters
on the router to match those on the server, and things haven't gotten any
better.  (The problem appears to be on the Ethernet side anyway, so you've
probably helped me fix a problem I didn't know I had...)
I don't see much discussion about configuring lnet parameters for Ethernet
networks, I assume that's using ksocklnd.  On that side, it appears that
all of the ksocklnd parameters match between the router and clients.
Interesting that peer_timeout is 180, which is almost exactly when my
client gets marked down on the router.

Server (and now router) ko2iblnd parameters:
peer_credits 8
peer_credits_hiw 4
credits 256
concurrent_sends 8
ntx 512
map_on_demand 0
fmr_pool_size 512
fmr_flush_trigger 384
fmr_cache 1

Client and router ksocklnd:
peer_timeout 180
peer_credits 8
keepalive 30
sock_timeout 50
credits 256
rx_buffer_size 0
tx_buffer_size 0
keepalive_idle 30
round_robin 1
sock_timeout 50

Thanks,
Kevin


On Mon, Oct 30, 2017 at 4:16 PM, Mohr Jr, Richard Frank (Rick Mohr) <
rmohr at utk.edu> wrote:

>
> > On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand <kevin at umd.edu> wrote:
> >
> > All of the hosts (client, server, router) have the following in
> ko2iblnd.conf:
> >
> > alias ko2iblnd-opa ko2iblnd
> > options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024
> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048
> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
> >
> > install ko2iblnd /usr/sbin/ko2iblnd-probe
>
> Those parameters will only get applied to omnipath interfaces (which you
> don’t have), so everything you have should just be running with default
> parameters.  Since your lnet routers have a different version of lustre
> than your servers/clients, it might be possible that the default values for
> the ko2iblnd parameters are different between the two versions.  You can
> always check this by looking at the values in the files under
> /sys/module/ko2iblnd/parameters.  It might be worthwhile to compare those
> values on the lnet routers to the values on the servers to see if maybe
> there is a difference that could affect the behavior.
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171030/eccc5df9/attachment-0001.html>


More information about the lustre-discuss mailing list