[lustre-discuss] mlx4 and mxl5 mix environment

Ms. Megan Larko dobsonunit at gmail.com
Wed Jun 24 13:32:27 PDT 2020


On 22 Jun 2020 "guru.novice" wrote:
Hi, all
We setup up a cluster use mlx4 and mlx5 driver mixed?all things goes well.
Later I find something in wiki
http://wiki.lustre.org/Infiniband_Configuration_Howto and
http://lists.onebuilding.org/pipermail/lustre-devel-lustre.org/2016-May/003842.html
which was
last edited on 2016.
So do i need to change lnet configuration described in this page ?
Or the problem has been resolved in new version (like 2.12.x) ?
Anymore where can i find more details ?

Any suggestions would be appreciated.
Thanks?

Hello guru.novice,
Lustre 2.12.x has some nice LNet configuration abilities.  The old
/etc/modprobe.d/ config files have been superceded by /etc/lnet.conf.   An
install of Lustre 2.12.x provides a sample of this file (with the lines
commented out).  Our experience has shown that not all lines are necessary;
edit to suit.

The Lustre 2.12.x has Multi-Rail (MR) on by default so Lustre will attempt
to automatically find active and viable LNet paths to use.  This should
have no issue with your mlx4/5 mix environment; we have some mixed IB and
eth that work. To explicitly use MR one may set "Multi-Rail: true" in the
"peer" NID section of the /etc/lnet.conf file.  But that was not necessary
for us.  We used a simple /etc/lnet.conf for MR systems:
File stub: /etc/lnet.conf
net:
   - net type: o2ib0
     local NI(s):
        - interfaces:
             0: ib0
  - net type: o2ib777
     local NI(s):
        - interfaces:
             0: ib0:1
This allowed LNet to use any NID o2ib0 and o2ib777.

Whatever is placed in the /etc/lnet.conf file is loaded into the kernel
modules used via the Lustre starting mechanism (CentOS uses
/usr/lib/systemd/system).  Because we are choosing _not_ to use MR on a
different box, we explicitly defined the available routes in /etc/lnet.conf
using the lines:
route:
   - net: tcp
     gateway: 10.10.10.101 at o2ib11111
   - net: tcp
     gateway: 10.10.10.102 at o2ib1111
And so on up to 10.10.10.116 at o2ib1111

 In CentOS7, /usr/lib/systemd/system/lnet.service file is reproduced
below.  (details: lustre-2.12.4-1 with Mellanox OFED version 4.7-1.0.0.1
and  kernel 3.10.957.27.2.el7)
File lnet.service:
[unit]
Description=lnet management
Requires=network-online.target
After=network-online.target openibd.service rdma.service opa.service
ConditionsPathExists=!/proc/sys/lnet/

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/sbin/modprobe lnet
ExecStart=/usr/sbin/lnetctl lnet configure
ExecStart=/usr/sbin/lnetctl set discover 0   <--Do NOT use this line if you
want MR function
ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf  <--The file with router,
credit and similar info
ExecStart=/usr/sbin/lnetctl peer add --nid 10.10.10.[101-116]@o2ib11111
--non_mr  <--Omit non_rm if you want to use MR
ExecStop=/usr/sbin/lustre_rmmod ptlrpc
ExecStop=/usr/sbin/lnetctl lnet unconfigure
ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs

[Install]
WantedBy=multi-user.target

I hope this info can help you in the right direction.

Cheers,
megan
PS. [I am willing to add/contribute to the
http://wiki.lustre.org/Infiniband_Configuration_Howto but I think my
account for wiki editing has expired (at least the one I thought I had did
not work).
Our site had issues with Multi-Rail "not socially distancing appropriately"
from other LNet networks so in our particular case we disabled MR.  (An
entirely different experience.) ]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200624/7f370f24/attachment.html>


More information about the lustre-discuss mailing list