[lustre-discuss] Lustre 2.7 deployment issues
Ray Muno
muno at umn.edu
Fri Dec 4 06:48:38 PST 2015
As I mentioned, I am doing a test install to see what I want to run for
deployment. We have run a couple Lustre installs, one 1.8.x based and a
current production one that is 2.3. The Lustre 2.3 server set has been
up for 750 days and has been very solid. This test replaces the old 1.8
setup and I need to come up with a consistent set of sever and clients
that I can run on our clusters. The cluster (Rocks based) will get
upgraded, most likely, once we have a working set. I have a set of
compute nodes that will be set up to run either CentOS 6.6 or 6.7.
I started with 2.7 since that is what I got pointed to when I went to
the lustre.org download page. The "Most Recent Release" points me at the
2.7.0 tree. If I follow the path to download source on that page,
git clone git://git.hpdd.intel.com/fs/lustre-release.git
It is not even apparent from the downloaded tree which version I would
be building. The Changelog file mentions 2.8 and 2.7. Everything on the
Lustre Download page seems to indicate I should be downloading 2.7.
Since I started with a clean install of a RHEL 6.6 on my server set, I
had the expectation that that pre-compiled server binaries would give me
a working set to test. That is when the frustration started. I tried
searching for clues by looking at errors that I saw, but I did not find
much that duplicated what I was seeing. I just saw some odd mentions
about IB having issues in 2.6.32-504.8.1. This did not directly
correlate with my issues but I figured I would try a later kernel. That
is whey I pulled the nightly build off of build.hpdd.intel.com and found
I could at least establish a set of servers that would talk to each other.
That is where I am at now. I am trying to wrap my head around where my
issues lie. Is the problem specific to my Qlogic InfiniPath_QLE7240
cards? Is it the underlying OS provided IB drivers? I guess I am just
really surprised that the distribution pointed to on the download page,
fails out of the box on a set of servers with a clean install of the
specified OS. I just figured I must be doing something wrong (which may
still be the case).
At this point, it looks like I should be backing out 2.7 and build this
with the current 2.5 release.
Before I do that, however, I would like to gain some understanding as to
what I am seeing right now. I have the server set built with 2.7.0 and
the 2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64 kernel on RHEL 6.6 (SL 6.6).
I rebuilt the 2.7.0 Lustre client on a RHEL (CentOS) 6.6 client, and I
could not mount the file system. It will mount my production Lustre file
system from another server set (2.3.0) with out a problem. I also tried
with a RHEL 6.7 install, with the 2.7 Lustre client rebuilt for the
kernel (2.6.32-573.8.1.el6.x86_64). The client will not mount the 2.7
Lustre file system and I cannot even (lctl ping) the server from the client.
On the client
[root at athena-head ~]# lctl ping 172.19.120.29 at o2ib
failed to ping 172.19.120.29 at o2ib: Input/output error
In dmesg
LNetError: 1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected())
172.19.120.29 at o2ib rejected: incompatible # of RDMA fragments 32, 256
On the Lustre MDS server.
Dec 3 18:14:08 lustre-mds kernel: LNet:
1493:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn
from 172.19.120.2 at o2ib (version 12): max_frags 256 too large (32 wanted)
Trying to mount on the client
[root at athena-head ~]# uname -a
Linux athena-head.aem.umn.edu 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov
10 18:01:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root at athena-head ~]# mount -t lustre 172.19.120.29 at o2ib:/ltest /ltest
mount.lustre: mount 172.19.120.29 at o2ib:/ltest at /ltest failed:
Input/output error
Is the MGS running?
Dec 3 18:21:16 athena-head kernel: LNetError:
1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 172.19.120.29 at o2ib
rejected: incompatible # of RDMA fragments 32, 256
Dec 3 18:21:16 athena-head kernel: Lustre:
6091:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1449188476/real 1449188476]
req at ffff88002f810c80 x1519567173058612/t0(0)
o250->MGC172.19.120.29 at o2ib@172.19.120.29 at o2ib:26/25 lens 400/544 e 0 to
1 dl 1449188481 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 3 18:21:41 athena-head kernel: LNetError:
1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 172.19.120.29 at o2ib
rejected: incompatible # of RDMA fragments 32, 256
Dec 3 18:21:41 athena-head kernel: Lustre:
6091:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1449188501/real 1449188501]
req at ffff88021e742c80 x1519567173058628/t0(0)
o250->MGC172.19.120.29 at o2ib@172.19.120.29 at o2ib:26/25 lens 400/544 e 0 to
1 dl 1449188511 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 3 18:21:53 athena-head kernel: LustreError: 15c-8:
MGC172.19.120.29 at o2ib: The configuration from log 'ltest-client' failed
(-5). This may be the result of communication errors between this node
and the MGS, a bad configuration, or other errors. See the syslog for
more information.
Dec 3 18:21:53 athena-head kernel: Lustre: Unmounted ltest-client
Dec 3 18:21:53 athena-head kernel: LustreError:
7346:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount (-5)
On the server
Dec 3 18:21:41 lustre-mds kernel: LNet:
1493:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn
from 172.19.120.2 at o2ib (version 12): max_frags 256 too large (32 wanted)
On 12/04/2015 06:49 AM, jerome.becot at inserm.fr wrote:
> Hi,
>
>
> I honestly don't know if the compiled versions available here are meant
> to be used by everyone but they are publicly browsable on Intel Jenkins :
>
> https://build.hpdd.intel.com
>
> but as the source is publicly available from the whamcloud git, there
> imo might not be any problem
>
> If you are in production stick to the 2.5.
>
> Regards
>
>
> Le 04-12-2015 12:18, Jon Tegner a écrit :
>> Hi,
>>
>> Where do you find the 2.7.x-releases? I thought fixes were only
>> released for the Intel maintenance version?
>>
>> Regards,
>>
>> /jon
>>
>> On 12/04/2015 11:43 AM, jerome.becot at inserm.fr wrote:
>>> Hello Ray,
>>>
>>> One consideration first : You try the 2.7 version which is not the
>>> production one (aka 2.5). From this perspective wether you run 2.7.0
>>> or 2.7.x won't make any big difference, it is the develpment release.
>>>
>>> Then if I understand the problem comes from the infiniband driver
>>> module which is buggy in the 2.6.32-504.8.1 kernel, meaning that you
>>> have to update the kernel to fix it. Doing this may result that the
>>> 2.7.0 version on the site, compiled on an older kernel version, will
>>> refuse to load then. (because kernel modules - i.e the lustre ones
>>> here - relies on features that may change between different kernel
>>> version making it incompatible)
>>>
>>> In any case you can try to rebuild the 2.7.0 version from the source
>>> to your new kernel. The procedure is quite easy :
>>>
>>> https://wiki.hpdd.intel.com/display/PUB/Rebuilding+the+Lustre-client+rpms+for+a+new+kernel
>>> It will regenerate the 2.7.0 client uppon your newer kernel with the
>>> working infinband modules, but the stability is not garanted as the
>>> 2.7 branch is under development anyway.
>>>
>>> Or use a precompiled one on the build site if you can't (some nasty
>>> bugs in the base 2.x.0 version are fixed in the latest builds)
>>>
>>> The only thing is to stick to the very same version on mds and oss
>>> and at least the same or newer version for the clients.
>>>
>>> Regards
>>>
>>> Le 03-12-2015 16:13, Ray Muno a écrit :
>>>> I am trying to set up a test deployment of Lustre 2.7.
>>>>
>>>> I pulled RPMS from http://lustre.org/download/ and installed them on a
>>>> set of server running Scientific Linux 6.6 which seems to be a proper
>>>> OS for deployment. Everything installs and I can format the
>>>> filesystems on the MDS (1) and OSS (2) servers. When I try and mount
>>>> the OST files systems, I get communication errors. I can "lctl ping"
>>>> the servers from each other, but cannot establish communication
>>>> between the MDS and OSS.
>>>>
>>>> The installation is on servers connected over Infiniband (Qlogic DDR
>>>> 4X).
>>>>
>>>> In trying to diagnose the issues related to the error messages, I
>>>> found mention in some list discussions that o2ib is broken in the
>>>> 2.6.32-504.8.1 kernel.
>>>>
>>>> After much frustration, I pulled a nightly build from
>>>> build.hpdd.intel.com (kernel
>>>> 2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64) and tried the same set up.
>>>> Everything worked as I expected.
>>>>
>>>> Am I missing something? Is the default release pointed to at
>>>> https://downloads.hpdd.intel.com/ for 2.7 broken in some way? Is it
>>>> just the hardware I am trying to deploy against?
>>>>
>>>> I can provide specifics about the errors I see, I am just posting this
>>>> to make sure I am pulling the Lustre RPM's from the proper source.
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
--
Ray Muno
Computer Systems Administrator
e-mail: muno at aem.umn.edu
Phone: (612) 625-9531
FAX: (612) 626-1558
University of Minnesota
Aerospace Engineering and Mechanics Mechanical Engineering
110 Union St. S.E. 111 Church Street SE
Minneapolis, MN 55455 Minneapolis, MN 55455
--
Ray Muno
Computer Systems Administrator
e-mail: muno at aem.umn.edu
Phone: (612) 625-9531
FAX: (612) 626-1558
University of Minnesota
Aerospace Engineering and Mechanics Mechanical Engineering
110 Union St. S.E. 111 Church Street SE
Minneapolis, MN 55455 Minneapolis, MN 55455
More information about the lustre-discuss
mailing list