[lustre-discuss] Lustre 2.7 deployment issues

Ray Muno muno at umn.edu
Fri Dec 4 06:48:38 PST 2015


As I mentioned, I am doing a test install to see what I want to run for 
deployment.  We have run a couple Lustre installs, one 1.8.x based and a 
current production one that is 2.3. The Lustre 2.3 server set has been 
up for 750 days and has been very solid.  This test replaces the old 1.8 
setup and I need to come up with a consistent set of sever and clients 
that I can run on our clusters. The cluster (Rocks based) will get 
upgraded, most likely, once we have a working set.  I have a set of 
compute nodes that will be set up to run either CentOS 6.6 or 6.7.

I started with 2.7 since that is what I got pointed to when I went to 
the lustre.org download page. The "Most Recent Release" points me at the 
2.7.0 tree.  If I follow the path to download source on that page,

git clone git://git.hpdd.intel.com/fs/lustre-release.git

It is not even apparent from the downloaded tree which version I would 
be building. The Changelog file mentions 2.8 and 2.7. Everything on the 
Lustre Download page seems to indicate I should be downloading 2.7.

Since I started with a clean install of a RHEL 6.6 on my server set, I 
had the expectation that that pre-compiled server binaries would give me 
a working set to test. That is when the frustration started. I tried 
searching for clues by looking at errors that I saw, but I did not find 
much that duplicated what I was seeing. I just saw some odd mentions 
about IB having issues in 2.6.32-504.8.1.  This did not directly 
correlate with my issues but I figured I would try a later kernel. That 
is whey I pulled the nightly build off of build.hpdd.intel.com and found 
I could at least establish a set of servers that would talk to each other.

That is where I am at now. I am trying to wrap my head around where my 
issues lie. Is the problem specific to my Qlogic InfiniPath_QLE7240 
cards?  Is it the underlying OS provided IB drivers?  I guess I am just 
really surprised that the distribution pointed to on the download page, 
fails out of the box on a set of servers with a clean install of the 
specified OS. I just figured I must be doing something wrong (which may 
still be the case).

At this point, it looks like I should be backing out 2.7 and build this 
with the current 2.5 release.

Before I do that, however, I would like to gain some understanding as to 
what I am seeing right now.  I have the server set built with 2.7.0 and 
the 2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64 kernel on RHEL 6.6 (SL 6.6).


I rebuilt the 2.7.0 Lustre client on a RHEL (CentOS) 6.6 client, and I 
could not mount the file system. It will mount my production Lustre file 
system from another server set (2.3.0) with out a problem.  I also tried 
with a RHEL 6.7 install, with the 2.7 Lustre client rebuilt for the 
kernel (2.6.32-573.8.1.el6.x86_64). The client will not mount the 2.7 
Lustre file system and I cannot even (lctl ping) the server from the client.

On the client

[root at athena-head ~]# lctl ping  172.19.120.29 at o2ib
failed to ping 172.19.120.29 at o2ib: Input/output error

In dmesg

LNetError: 1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 
172.19.120.29 at o2ib rejected: incompatible # of RDMA fragments 32, 256

On the Lustre MDS server.

Dec  3 18:14:08 lustre-mds kernel: LNet: 
1493:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn 
from 172.19.120.2 at o2ib (version 12): max_frags 256 too large (32 wanted)

Trying to mount on the client

[root at athena-head ~]# uname -a
Linux athena-head.aem.umn.edu 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov 
10 18:01:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

[root at athena-head ~]# mount -t lustre  172.19.120.29 at o2ib:/ltest /ltest
mount.lustre: mount 172.19.120.29 at o2ib:/ltest at /ltest failed: 
Input/output error
Is the MGS running?

Dec  3 18:21:16 athena-head kernel: LNetError: 
1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 172.19.120.29 at o2ib 
rejected: incompatible # of RDMA fragments 32, 256
Dec  3 18:21:16 athena-head kernel: Lustre: 
6091:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has 
failed due to network error: [sent 1449188476/real 1449188476] 
req at ffff88002f810c80 x1519567173058612/t0(0) 
o250->MGC172.19.120.29 at o2ib@172.19.120.29 at o2ib:26/25 lens 400/544 e 0 to 
1 dl 1449188481 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec  3 18:21:41 athena-head kernel: LNetError: 
1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 172.19.120.29 at o2ib 
rejected: incompatible # of RDMA fragments 32, 256
Dec  3 18:21:41 athena-head kernel: Lustre: 
6091:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has 
failed due to network error: [sent 1449188501/real 1449188501] 
req at ffff88021e742c80 x1519567173058628/t0(0) 
o250->MGC172.19.120.29 at o2ib@172.19.120.29 at o2ib:26/25 lens 400/544 e 0 to 
1 dl 1449188511 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec  3 18:21:53 athena-head kernel: LustreError: 15c-8: 
MGC172.19.120.29 at o2ib: The configuration from log 'ltest-client' failed 
(-5). This may be the result of communication errors between this node 
and the MGS, a bad configuration, or other errors. See the syslog for 
more information.
Dec  3 18:21:53 athena-head kernel: Lustre: Unmounted ltest-client
Dec  3 18:21:53 athena-head kernel: LustreError: 
7346:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount  (-5)

On the server

Dec  3 18:21:41 lustre-mds kernel: LNet: 
1493:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn 
from 172.19.120.2 at o2ib (version 12): max_frags 256 too large (32 wanted)



On 12/04/2015 06:49 AM, jerome.becot at inserm.fr wrote:
> Hi,
>
>
> I honestly don't know if the compiled versions available here are meant
> to be used by everyone but they are publicly browsable on Intel Jenkins :
>
> https://build.hpdd.intel.com
>
> but as the source is publicly available from the whamcloud git, there
> imo might not be any problem
>
> If you are in production stick to the 2.5.
>
> Regards
>
>
> Le 04-12-2015 12:18, Jon Tegner a écrit :
>> Hi,
>>
>> Where do you find the 2.7.x-releases? I thought fixes were only
>> released for the Intel maintenance version?
>>
>> Regards,
>>
>> /jon
>>
>> On 12/04/2015 11:43 AM, jerome.becot at inserm.fr wrote:
>>> Hello Ray,
>>>
>>> One consideration first : You try the 2.7 version which is not the
>>> production one (aka 2.5). From this perspective wether you run 2.7.0
>>> or 2.7.x won't make any big difference, it is the develpment release.
>>>
>>> Then if I understand the problem comes from the infiniband driver
>>> module which is buggy in the 2.6.32-504.8.1 kernel, meaning that you
>>> have to update the kernel to fix it. Doing this may result that the
>>> 2.7.0 version on the site, compiled on an older kernel version, will
>>> refuse to load then. (because kernel modules - i.e the lustre ones
>>> here -  relies on features that may change between different kernel
>>> version making it incompatible)
>>>
>>> In any case you can try to rebuild the 2.7.0 version from the source
>>> to your new kernel. The procedure is quite easy :
>>>
>>> https://wiki.hpdd.intel.com/display/PUB/Rebuilding+the+Lustre-client+rpms+for+a+new+kernel
>>> It will regenerate the 2.7.0 client uppon your newer kernel with the
>>> working infinband modules, but the stability is not garanted as the
>>> 2.7 branch is under development anyway.
>>>
>>> Or use a precompiled one on the build site if you can't (some nasty
>>> bugs in the base 2.x.0 version are fixed in the latest builds)
>>>
>>> The only thing is to stick to the very same version on mds and oss
>>> and at least the same or newer version for the clients.
>>>
>>> Regards
>>>
>>> Le 03-12-2015 16:13, Ray Muno a écrit :
>>>> I am trying to set up a test deployment of Lustre 2.7.
>>>>
>>>> I pulled RPMS from http://lustre.org/download/ and installed them on a
>>>> set of server running Scientific Linux 6.6 which seems to be a proper
>>>> OS for deployment.  Everything installs and I can format the
>>>> filesystems on the MDS (1) and OSS (2) servers. When I try and mount
>>>> the OST files systems, I get communication errors. I can "lctl ping"
>>>> the servers from each other, but cannot establish communication
>>>> between the MDS and OSS.
>>>>
>>>> The installation is on servers connected over Infiniband (Qlogic DDR
>>>> 4X).
>>>>
>>>> In trying to diagnose the issues related to the error messages, I
>>>> found mention in some list discussions that o2ib is broken in the
>>>> 2.6.32-504.8.1 kernel.
>>>>
>>>> After much frustration, I pulled a nightly build from
>>>> build.hpdd.intel.com (kernel
>>>> 2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64) and tried the same set up.
>>>> Everything worked as I expected.
>>>>
>>>> Am I missing something? Is the default release pointed to at
>>>> https://downloads.hpdd.intel.com/ for 2.7 broken in some way? Is it
>>>> just the hardware I am trying to deploy against?
>>>>
>>>> I can provide specifics about the errors I see, I am just posting this
>>>> to make sure I am pulling the Lustre RPM's from the proper source.
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


-- 

  Ray Muno
  Computer Systems Administrator
  e-mail:   muno at aem.umn.edu
  Phone:   (612) 625-9531
  FAX:     (612) 626-1558

                           University of Minnesota
  Aerospace Engineering and Mechanics         Mechanical Engineering
  110 Union St. S.E.                          111 Church Street SE
  Minneapolis, MN 55455                       Minneapolis, MN 55455

-- 

  Ray Muno
  Computer Systems Administrator
  e-mail:   muno at aem.umn.edu
  Phone:   (612) 625-9531
  FAX:     (612) 626-1558

                           University of Minnesota
  Aerospace Engineering and Mechanics         Mechanical Engineering
  110 Union St. S.E.                          111 Church Street SE
  Minneapolis, MN 55455                       Minneapolis, MN 55455


More information about the lustre-discuss mailing list