[Lustre-discuss] [wc-discuss] Lustre 2.2 production experience

Wojciech Turek wjt27 at cam.ac.uk
Wed Jul 11 04:26:27 PDT 2012


Answering to my own question:
We have been running Lustre-2.1.2 on production system (768 nodes -
university wide cluster) for about 2 weeks and I have mixed feelings
about it. The server side seem to be handling the load ok however on
the client side we have been experiencing crashes. It turns out that
most of them are related to statahead code and since we have disabled
it (last weekend) things are running smoother. I am surprised that
Lustre-2.1 which supposed to be a stable production version has a
buggy code enabled by default. There is JIRA ticket which states that
statahead code is buggy and should not be used in 2.1
http://jira.whamcloud.com/browse/LU-1216

Another issue which does not fill me with confidence is that client
logs contain quite a lot of "DEADLOCK POSSIBLE!" messages which sound
potentially dangerous. However from
http://jira.whamcloud.com/browse/LU-1418 it seems that they are
harmless left over code and the actual deadlock is not possible to
occur and if that is the case they should be removed from the code.

Lustre-2.1 comes with panic_on_lbug turned on by default which is
different from default in Lustre-1.8 I think it would be useful to
list the changes of defaults settings between the Lustre versions (I
could not find that list in Lustre changelog). This particular setting
is annoying on compute nodes and login nodes where we would rather let
a process hang with LBUG and keep the node operational rather than
crash it.

Other than that it seem to be running OK after implementing
workarounds for issues described above.




On 15 June 2012 10:53, Andreas Dilger <adilger at whamcloud.com> wrote:
> On 2012-06-14, at 8:48 PM, Nathan Rutman wrote:
>> I wasn't complaining, just asking ;)
>
> I wasn't feeling put-upon, but just explaining (mostly to the other readers of these lists) the reasons why we don't necessarily make every release a maintenance release.
>
>> On Jun 14, 2012, at 6:27 PM, "Andreas Dilger" <adilger at whamcloud.com> wrote:
>>
>>> I think the stability of 2.2.0 is comparable to 2.1.0.
>>>
>>> One issue is about the number of separate maintenance releases that can be tested. If there are many maintenance releases, then each of those branches would get correspondingly less testing time before release.
>>>
>>> Secondly, there is a limit on the amount of time that can be spent on porting patches to each maintenance release.
>>>
>>> This system of maintenance vs. feature releases is similar to what is done for Ubuntu "Long Term Stability" (LTS) regular releases, and Fedora vs. RHEL. While there is a desire to make each release as reliable as possible, the resources needed to maintain all of the releases for a long time would be very high.
>>>
>>> Cheers, Andreas
>>>
>>> On 2012-06-14, at 17:48, "Nathan Rutman" <Nathan_Rutman at xyratex.com> wrote:
>>>
>>>> Is there a belief that Lustre 2.2 is any less stable than Lustre 2.1.0?  IOW, are the new features introduced in 2.2 believed to introduce more risk?
>>>>
>>>> On Jun 9, 2012, at 3:20 PM, Andreas Dilger wrote:
>>>>
>>>>> I guess the new Lustre release process is similar to how Ubuntu is released. While we do our best to make each release as stable as possible, there is a different expectation for long-term updates of the feature releases and the maintenance releases.
>>>>>
>>>>> Cheers, Andreas
>>>>>
>>>>> On 2012-06-09, at 16:05, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
>>>>>
>>>>>> Thanks for a quick reply Andreas. I slightly misunderstood the lustre
>>>>>> release process and thought that the next stable/production version is
>>>>>> 2.2
>>>>>>
>>>>>> I am then interested in the experience of people running Lustre 2.1
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Wojciech
>>>>>>
>>>>>> On 9 June 2012 21:52, Andreas Dilger <adilger at whamcloud.com> wrote:
>>>>>>> I think you'll find that there are not yet (m)any production deployments of 2.2. There are a number of production 2.1 deployments, and this is the current maintenance stream from Whamcloud.
>>>>>>>
>>>>>>> Cheers, Andreas
>>>>>>>
>>>>>>> On 2012-06-09, at 14:33, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
>>>>>>>
>>>>>>>> I am building a 1.5PB storage system which will employ Lustre as the
>>>>>>>> main file system. The storage system will be extended at the later
>>>>>>>> stage beyond 2PB.  I am considering using Lustre 2.2 for production
>>>>>>>> environment. This Lustre storage system will replace our older 300TB
>>>>>>>> system which is currently running Lustre 1.8.8. I am quite happy with
>>>>>>>> lustre 1.8.8 however for the new system Lustre 2.2 seem to be a better
>>>>>>>> match.  The storage system will be attached to a university wide
>>>>>>>> cluster (800 nodes), hence there will be quite a large range of
>>>>>>>> applications using the filesystem. Could people with production
>>>>>>>> deployments of Lustre 2.2 share their experience please?
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Wojciech Turek
>>>>>>>> _______________________________________________
>>>>>>>> Lustre-discuss mailing list
>>>>>>>> Lustre-discuss at lists.lustre.org
>>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> Cheers, Andreas
> --
> Andreas Dilger                       Whamcloud, Inc.
> Principal Lustre Engineer            http://www.whamcloud.com/
>
>
>



More information about the lustre-discuss mailing list