[Lustre-discuss] Lustre 1.6.3 - where are the bug fixes?

Charles Taylor taylor at hpc.ufl.edu
Fri Oct 12 02:49:44 PDT 2007


Hmmm.   This is the approach that has always frightened us away from  
Lustre.   Is there one version of Lustre for paying support customers  
and another for those who, for whatever reason, can't or don't want  
to pay for support.   What do you mean when you say "assist you more  
effectively"?   Does that mean you will apply the patches in the  
bugzilla database for him?   How does that help if he can do it  
himself?   The point Nikke is making is that a release labeled  
"production" should run for more than a few hours to a day without  
crashing.   And, if you have a "production" release that crashes that  
readily AND there are known fixes, it seems like they should be  
readily available in a format that is easy to apply (as opposed to  
digging through bug reports and applying code fixes by hand).

For a lot of very good reasons, we would like to go to lustre.    
There is much to like about it.  However, we run OFED 1.2 on our  
cluster and need Lustre 1.6.2+ for OFED support.   So far, our  
attempts to test this version of lustre on our 400+ node IB cluster  
have resulted in impressive performance and scalability and...lots of  
crashes (mballoc) and a corrupt file system that neither e2fsck nor  
lfsck could fix.   It is too bad because it seems that lustre is just  
a few fixes away from having one of the most amazing open source  
packages in the Linux universe.

We wish you the best and hope that we will be able to use Lustre in  
the near future.

Charlie Taylor
UF HPC Center


On Oct 12, 2007, at 3:20 AM, Kevin Canady wrote:

> Nikke,
> Are you customer of Lustre Support? I don't have you listed as a  
> supported
> customer.  Maybe we should arrange a discussion about how we could  
> assist
> you more effectively.
>
> Best regards,
> Kevin
> -- 
> P. Kevin Canady
> Director, Business Development
> Lustre Group (Formerly CFS)
> Sun Microsystems, Inc.
> O: 415.928.3633
> C: 415.505.7701
>
>
> On 10/11/07 11:53 PM, "Niklas Edmundsson"  
> <Niklas.Edmundsson at hpc2n.umu.se>
> wrote:
>
>>
>> OK, I know that there is supposedly some QA before lustre releases  
>> and
>> that it might be the reason for fixes taking such a long time to
>> propagate, but still: It takes too long for fixes to end up in a
>> released version...
>>
>> During our rather limited testing on Ubuntu Dapper (using the Debian
>> 2.6.18 kernel on servers and pkg-lustre packaging) we've run into
>> a couple of bugs, most of them with the typical "fix in bugzilla".
>>
>> The pkg-lustre packaging has six fixes from bugzilla applied, they
>> seem to have munged the bug numbers but it seems that only three of
>> them are in the 1.6.3 changelog.
>>
>> We have locally applied fixes from bug 13438 (lustre is totally
>> useless without it due to servers OOPS:ing) and 13614. None of them
>> seems to be in the 1.6.3 changelog.
>>
>> So, I'd suggest that CFS gets their act together and starts releasing
>> versions more often, if they'd done this during 1.6 development we
>> wouldn't be installing production releases that you can crash after a
>> day of testing now.
>>
>> If QA is the argument for not doing releases more often, consider the
>> fact that known broken releases that you have to patch yourself with
>> patches hidden in bugzilla isn't much better.
>>
>> In reality, I think that doing non-QA'd snapshot releases might be  
>> the
>> way to go. That is, releases with the useful more-or-less trivial
>> fixes that avoids crashes etc. and that will be included in the next
>> QA'd release. They would not be suitable for production, but at least
>> you can rather easily download the latest snapshot and try on your
>> test cluster and see if it fixes the problem(s) you've encountered.
>> And if it does, we can bug CFS until they get their act together and
>> gets a release out with the fix included.
>>
>> In the end, you have to realise that when you have a production  
>> system
>> you don't want to wait for weeks and months for a new release that
>> might fix a crash-inducing bug you're hitting. I say might here,
>> because obviously having a fix hidden in bugzilla is no guarantee  
>> that
>> it's included in a released version.
>>
>> In our case we're not at production yet because of these problems  
>> with
>> getting fixes out quickly enough. So far we've always been able to
>> crash lustre 1.6 within days, and that's after waiting for 1.6 for
>> well over a year.
>>
>> So, I'd like to challenge CFS to get a version of lustre 1.6 (or 1.8,
>> whatever) out that proves stable on our small lustre test setup.
>> Without patches. In the year of 2007.
>>
>> Since the "internal QA only" approach obviously isn't working, I'd
>> suggest that you embrace "release early, release often" to get there.
>> That means one release per week as long as you have fixes pending to
>> get a decent churn on things.
>>
>>
>> /Nikke
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list