[lustre-discuss] Lustre 2.7 deployment issues

Ray Muno muno at umn.edu
Fri Dec 4 07:52:46 PST 2015


Client was rebuilt locally from the source RPM's. I thought I had built 
it from the client source from the nightly build but I can see now it 
was the 2.7.0 source

lustre-client-2.7.0-2.6.32_504.8.1.el6.x86_64.src.rpm

  Client kernel is the OS provided kernel.

At this point I have ripped out all of the 2.7.0 based install and 
re-built everything with the current 2.5.3 pre-built RPMS for the 
server. The test client is RHEL 6.7 so I built the client locally 
against the current kernel.  I can now mount the filesystem at least.



On 12/04/2015 09:24 AM, jerome.becot at inserm.fr wrote:
> Ok,
>
> I am not using IB here but it looks obvious that the max_frag value
> differs between the MGS and the client.
>
> Do you use the same lustre version on the MGS/OSS AND the client built
> on the same Kernel version ? (ie lustre*-KERNEL_VERSION-LUSTRE_VERSION)
>
> Did you try it with the latest nightly build ?
>
> If so, i let developers answer or maybe you can open a bug
>
> Regards
>
> Le 04-12-2015 15:48, Ray Muno a écrit :
>> As I mentioned, I am doing a test install to see what I want to run
>> for deployment.  We have run a couple Lustre installs, one 1.8.x based
>> and a current production one that is 2.3. The Lustre 2.3 server set
>> has been up for 750 days and has been very solid.  This test replaces
>> the old 1.8 setup and I need to come up with a consistent set of sever
>> and clients that I can run on our clusters. The cluster (Rocks based)
>> will get upgraded, most likely, once we have a working set.  I have a
>> set of compute nodes that will be set up to run either CentOS 6.6 or
>> 6.7.
>>
>> I started with 2.7 since that is what I got pointed to when I went to
>> the lustre.org download page. The "Most Recent Release" points me at
>> the 2.7.0 tree.  If I follow the path to download source on that page,
>>
>> git clone git://git.hpdd.intel.com/fs/lustre-release.git
>>
>> It is not even apparent from the downloaded tree which version I would
>> be building. The Changelog file mentions 2.8 and 2.7. Everything on
>> the Lustre Download page seems to indicate I should be downloading
>> 2.7.
>>
>> Since I started with a clean install of a RHEL 6.6 on my server set, I
>> had the expectation that that pre-compiled server binaries would give
>> me a working set to test. That is when the frustration started. I
>> tried searching for clues by looking at errors that I saw, but I did
>> not find much that duplicated what I was seeing. I just saw some odd
>> mentions about IB having issues in 2.6.32-504.8.1.  This did not
>> directly correlate with my issues but I figured I would try a later
>> kernel. That is whey I pulled the nightly build off of
>> build.hpdd.intel.com and found I could at least establish a set of
>> servers that would talk to each other.
>>
>> That is where I am at now. I am trying to wrap my head around where my
>> issues lie. Is the problem specific to my Qlogic InfiniPath_QLE7240
>> cards?  Is it the underlying OS provided IB drivers?  I guess I am
>> just really surprised that the distribution pointed to on the download
>> page, fails out of the box on a set of servers with a clean install of
>> the specified OS. I just figured I must be doing something wrong
>> (which may still be the case).
>>
>> At this point, it looks like I should be backing out 2.7 and build
>> this with the current 2.5 release.
>>
>> Before I do that, however, I would like to gain some understanding as
>> to what I am seeing right now.  I have the server set built with 2.7.0
>> and the 2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64 kernel on RHEL 6.6
>> (SL 6.6).
>>
>>
>> I rebuilt the 2.7.0 Lustre client on a RHEL (CentOS) 6.6 client, and I
>> could not mount the file system. It will mount my production Lustre
>> file system from another server set (2.3.0) with out a problem.  I
>> also tried with a RHEL 6.7 install, with the 2.7 Lustre client rebuilt
>> for the kernel (2.6.32-573.8.1.el6.x86_64). The client will not mount
>> the 2.7 Lustre file system and I cannot even (lctl ping) the server
>> from the client.
>>
>> On the client
>>
>> [root at athena-head ~]# lctl ping  172.19.120.29 at o2ib
>> failed to ping 172.19.120.29 at o2ib: Input/output error
>>
>> In dmesg
>>
>> LNetError: 1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected())
>> 172.19.120.29 at o2ib rejected: incompatible # of RDMA fragments 32, 256
>>
>> On the Lustre MDS server.
>>
>> Dec  3 18:14:08 lustre-mds kernel: LNet:
>> 1493:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn
>> from 172.19.120.2 at o2ib (version 12): max_frags 256 too large (32
>> wanted)
>>
>> Trying to mount on the client
>>
>> [root at athena-head ~]# uname -a
>> Linux athena-head.aem.umn.edu 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov
>> 10 18:01:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>
>> [root at athena-head ~]# mount -t lustre  172.19.120.29 at o2ib:/ltest /ltest
>> mount.lustre: mount 172.19.120.29 at o2ib:/ltest at /ltest failed:
>> Input/output error
>> Is the MGS running?
>>
>> Dec  3 18:21:16 athena-head kernel: LNetError:
>> 1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 172.19.120.29 at o2ib
>> rejected: incompatible # of RDMA fragments 32, 256
>> Dec  3 18:21:16 athena-head kernel: Lustre:
>> 6091:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent
>> has failed due to network error: [sent 1449188476/real 1449188476]
>> req at ffff88002f810c80 x1519567173058612/t0(0)
>> o250->MGC172.19.120.29 at o2ib@172.19.120.29 at o2ib:26/25 lens 400/544 e 0
>> to 1 dl 1449188481 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>> Dec  3 18:21:41 athena-head kernel: LNetError:
>> 1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 172.19.120.29 at o2ib
>> rejected: incompatible # of RDMA fragments 32, 256
>> Dec  3 18:21:41 athena-head kernel: Lustre:
>> 6091:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent
>> has failed due to network error: [sent 1449188501/real 1449188501]
>> req at ffff88021e742c80 x1519567173058628/t0(0)
>> o250->MGC172.19.120.29 at o2ib@172.19.120.29 at o2ib:26/25 lens 400/544 e 0
>> to 1 dl 1449188511 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>> Dec  3 18:21:53 athena-head kernel: LustreError: 15c-8:
>> MGC172.19.120.29 at o2ib: The configuration from log 'ltest-client'
>> failed (-5). This may be the result of communication errors between
>> this node and the MGS, a bad configuration, or other errors. See the
>> syslog for more information.
>> Dec  3 18:21:53 athena-head kernel: Lustre: Unmounted ltest-client
>> Dec  3 18:21:53 athena-head kernel: LustreError:
>> 7346:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount  (-5)
>>
>> On the server
>>
>> Dec  3 18:21:41 lustre-mds kernel: LNet:
>> 1493:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn
>> from 172.19.120.2 at o2ib (version 12): max_frags 256 too large (32
>> wanted)
>>
>>
>>
>> On 12/04/2015 06:49 AM, jerome.becot at inserm.fr wrote:
>>> Hi,
>>>
>>>
>>> I honestly don't know if the compiled versions available here are meant
>>> to be used by everyone but they are publicly browsable on Intel
>>> Jenkins :
>>>
>>> https://build.hpdd.intel.com
>>>
>>> but as the source is publicly available from the whamcloud git, there
>>> imo might not be any problem
>>>
>>> If you are in production stick to the 2.5.
>>>
>>> Regards
>>>
>>>
>>> Le 04-12-2015 12:18, Jon Tegner a écrit :
>>>> Hi,
>>>>
>>>> Where do you find the 2.7.x-releases? I thought fixes were only
>>>> released for the Intel maintenance version?
>>>>
>>>> Regards,
>>>>
>>>> /jon
>>>>
>>>> On 12/04/2015 11:43 AM, jerome.becot at inserm.fr wrote:
>>>>> Hello Ray,
>>>>>
>>>>> One consideration first : You try the 2.7 version which is not the
>>>>> production one (aka 2.5). From this perspective wether you run 2.7.0
>>>>> or 2.7.x won't make any big difference, it is the develpment release.
>>>>>
>>>>> Then if I understand the problem comes from the infiniband driver
>>>>> module which is buggy in the 2.6.32-504.8.1 kernel, meaning that you
>>>>> have to update the kernel to fix it. Doing this may result that the
>>>>> 2.7.0 version on the site, compiled on an older kernel version, will
>>>>> refuse to load then. (because kernel modules - i.e the lustre ones
>>>>> here -  relies on features that may change between different kernel
>>>>> version making it incompatible)
>>>>>
>>>>> In any case you can try to rebuild the 2.7.0 version from the source
>>>>> to your new kernel. The procedure is quite easy :
>>>>>
>>>>> https://wiki.hpdd.intel.com/display/PUB/Rebuilding+the+Lustre-client+rpms+for+a+new+kernel
>>>>>
>>>>> It will regenerate the 2.7.0 client uppon your newer kernel with the
>>>>> working infinband modules, but the stability is not garanted as the
>>>>> 2.7 branch is under development anyway.
>>>>>
>>>>> Or use a precompiled one on the build site if you can't (some nasty
>>>>> bugs in the base 2.x.0 version are fixed in the latest builds)
>>>>>
>>>>> The only thing is to stick to the very same version on mds and oss
>>>>> and at least the same or newer version for the clients.
>>>>>
>>>>> Regards
>>>>>
>>>>> Le 03-12-2015 16:13, Ray Muno a écrit :
>>>>>> I am trying to set up a test deployment of Lustre 2.7.
>>>>>>
>>>>>> I pulled RPMS from http://lustre.org/download/ and installed them
>>>>>> on a
>>>>>> set of server running Scientific Linux 6.6 which seems to be a proper
>>>>>> OS for deployment.  Everything installs and I can format the
>>>>>> filesystems on the MDS (1) and OSS (2) servers. When I try and mount
>>>>>> the OST files systems, I get communication errors. I can "lctl ping"
>>>>>> the servers from each other, but cannot establish communication
>>>>>> between the MDS and OSS.
>>>>>>
>>>>>> The installation is on servers connected over Infiniband (Qlogic DDR
>>>>>> 4X).
>>>>>>
>>>>>> In trying to diagnose the issues related to the error messages, I
>>>>>> found mention in some list discussions that o2ib is broken in the
>>>>>> 2.6.32-504.8.1 kernel.
>>>>>>
>>>>>> After much frustration, I pulled a nightly build from
>>>>>> build.hpdd.intel.com (kernel
>>>>>> 2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64) and tried the same set up.
>>>>>> Everything worked as I expected.
>>>>>>
>>>>>> Am I missing something? Is the default release pointed to at
>>>>>> https://downloads.hpdd.intel.com/ for 2.7 broken in some way? Is it
>>>>>> just the hardware I am trying to deploy against?
>>>>>>
>>>>>> I can provide specifics about the errors I see, I am just posting
>>>>>> this
>>>>>> to make sure I am pulling the Lustre RPM's from the proper source.
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>>
>> --
>>
>>  Ray Muno
>>  Computer Systems Administrator
>>  e-mail:   muno at aem.umn.edu
>>  Phone:   (612) 625-9531
>>  FAX:     (612) 626-1558
>>
>>                           University of Minnesota
>>  Aerospace Engineering and Mechanics         Mechanical Engineering
>>  110 Union St. S.E.                          111 Church Street SE
>>  Minneapolis, MN 55455                       Minneapolis, MN 55455
>>
>> --
>>
>>  Ray Muno
>>  Computer Systems Administrator
>>  e-mail:   muno at aem.umn.edu
>>  Phone:   (612) 625-9531
>>  FAX:     (612) 626-1558
>>
>>                           University of Minnesota
>>  Aerospace Engineering and Mechanics         Mechanical Engineering
>>  110 Union St. S.E.                          111 Church Street SE
>>  Minneapolis, MN 55455                       Minneapolis, MN 55455
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


-- 

  Ray Muno
  Computer Systems Administrator
  e-mail:   muno at aem.umn.edu
  Phone:   (612) 625-9531
  FAX:     (612) 626-1558

                           University of Minnesota
  Aerospace Engineering and Mechanics         Mechanical Engineering
  110 Union St. S.E.                          111 Church Street SE
  Minneapolis, MN 55455                       Minneapolis, MN 55455

-- 

  Ray Muno
  Computer Systems Administrator
  e-mail:   muno at aem.umn.edu
  Phone:   (612) 625-9531
  FAX:     (612) 626-1558

                           University of Minnesota
  Aerospace Engineering and Mechanics         Mechanical Engineering
  110 Union St. S.E.                          111 Church Street SE
  Minneapolis, MN 55455                       Minneapolis, MN 55455


More information about the lustre-discuss mailing list