[lustre-discuss] Lustre routing help needed

Kevin M. Hildebrand kevin at umd.edu
Wed Nov 1 13:50:23 PDT 2017


So apparently the issue is indeed with the combination of using a Lustre
2.10.1 router with 2.8 servers and clients.  Downgrading the router to 2.9
seems to have solved the problem.
(I can't run 2.8 on the router, because I'm running MOFED 4.1 for the
Mellanox ConnectX-5, and I can't get 2.8 to build with that version...)

Thanks, everyone, for your assistance!
Kevin


On Mon, Oct 30, 2017 at 5:47 PM, Dilger, Andreas <andreas.dilger at intel.com>
wrote:

> The 2.10 release added support for multi-rail LNet, which may potentially
> be causing problems here. I would suggest to install an older LNet version
> on your routers to match your client/server.
>
> You may need to build your own RPMs for your new kernel, but can use
> --disable-server for configure to simplify things.
>
> Cheers, Andreas
>
> On Oct 31, 2017, at 04:45, Kevin M. Hildebrand <kevin at umd.edu> wrote:
>
> Thanks, I completely missed that.  Indeed the ko2iblnd parameters were
> different between the servers and the router.  I've updated the parameters
> on the router to match those on the server, and things haven't gotten any
> better.  (The problem appears to be on the Ethernet side anyway, so you've
> probably helped me fix a problem I didn't know I had...)
> I don't see much discussion about configuring lnet parameters for Ethernet
> networks, I assume that's using ksocklnd.  On that side, it appears that
> all of the ksocklnd parameters match between the router and clients.
> Interesting that peer_timeout is 180, which is almost exactly when my
> client gets marked down on the router.
>
> Server (and now router) ko2iblnd parameters:
> peer_credits 8
> peer_credits_hiw 4
> credits 256
> concurrent_sends 8
> ntx 512
> map_on_demand 0
> fmr_pool_size 512
> fmr_flush_trigger 384
> fmr_cache 1
>
> Client and router ksocklnd:
> peer_timeout 180
> peer_credits 8
> keepalive 30
> sock_timeout 50
> credits 256
> rx_buffer_size 0
> tx_buffer_size 0
> keepalive_idle 30
> round_robin 1
> sock_timeout 50
>
> Thanks,
> Kevin
>
>
> On Mon, Oct 30, 2017 at 4:16 PM, Mohr Jr, Richard Frank (Rick Mohr) <
> rmohr at utk.edu> wrote:
>
>>
>> > On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand <kevin at umd.edu> wrote:
>> >
>> > All of the hosts (client, server, router) have the following in
>> ko2iblnd.conf:
>> >
>> > alias ko2iblnd-opa ko2iblnd
>> > options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024
>> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048
>> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
>> >
>> > install ko2iblnd /usr/sbin/ko2iblnd-probe
>>
>> Those parameters will only get applied to omnipath interfaces (which you
>> don’t have), so everything you have should just be running with default
>> parameters.  Since your lnet routers have a different version of lustre
>> than your servers/clients, it might be possible that the default values for
>> the ko2iblnd parameters are different between the two versions.  You can
>> always check this by looking at the values in the files under
>> /sys/module/ko2iblnd/parameters.  It might be worthwhile to compare
>> those values on the lnet routers to the values on the servers to see if
>> maybe there is a difference that could affect the behavior.
>>
>> --
>> Rick Mohr
>> Senior HPC System Administrator
>> National Institute for Computational Sciences
>> http://www.nics.tennessee.edu
>>
>>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171101/3268a25d/attachment.html>


More information about the lustre-discuss mailing list