[Lustre-discuss] Lustre 1.6.3 - where are the bug fixes?
Charles Taylor
taylor at hpc.ufl.edu
Fri Oct 12 02:49:44 PDT 2007
Hmmm. This is the approach that has always frightened us away from
Lustre. Is there one version of Lustre for paying support customers
and another for those who, for whatever reason, can't or don't want
to pay for support. What do you mean when you say "assist you more
effectively"? Does that mean you will apply the patches in the
bugzilla database for him? How does that help if he can do it
himself? The point Nikke is making is that a release labeled
"production" should run for more than a few hours to a day without
crashing. And, if you have a "production" release that crashes that
readily AND there are known fixes, it seems like they should be
readily available in a format that is easy to apply (as opposed to
digging through bug reports and applying code fixes by hand).
For a lot of very good reasons, we would like to go to lustre.
There is much to like about it. However, we run OFED 1.2 on our
cluster and need Lustre 1.6.2+ for OFED support. So far, our
attempts to test this version of lustre on our 400+ node IB cluster
have resulted in impressive performance and scalability and...lots of
crashes (mballoc) and a corrupt file system that neither e2fsck nor
lfsck could fix. It is too bad because it seems that lustre is just
a few fixes away from having one of the most amazing open source
packages in the Linux universe.
We wish you the best and hope that we will be able to use Lustre in
the near future.
Charlie Taylor
UF HPC Center
On Oct 12, 2007, at 3:20 AM, Kevin Canady wrote:
> Nikke,
> Are you customer of Lustre Support? I don't have you listed as a
> supported
> customer. Maybe we should arrange a discussion about how we could
> assist
> you more effectively.
>
> Best regards,
> Kevin
> --
> P. Kevin Canady
> Director, Business Development
> Lustre Group (Formerly CFS)
> Sun Microsystems, Inc.
> O: 415.928.3633
> C: 415.505.7701
>
>
> On 10/11/07 11:53 PM, "Niklas Edmundsson"
> <Niklas.Edmundsson at hpc2n.umu.se>
> wrote:
>
>>
>> OK, I know that there is supposedly some QA before lustre releases
>> and
>> that it might be the reason for fixes taking such a long time to
>> propagate, but still: It takes too long for fixes to end up in a
>> released version...
>>
>> During our rather limited testing on Ubuntu Dapper (using the Debian
>> 2.6.18 kernel on servers and pkg-lustre packaging) we've run into
>> a couple of bugs, most of them with the typical "fix in bugzilla".
>>
>> The pkg-lustre packaging has six fixes from bugzilla applied, they
>> seem to have munged the bug numbers but it seems that only three of
>> them are in the 1.6.3 changelog.
>>
>> We have locally applied fixes from bug 13438 (lustre is totally
>> useless without it due to servers OOPS:ing) and 13614. None of them
>> seems to be in the 1.6.3 changelog.
>>
>> So, I'd suggest that CFS gets their act together and starts releasing
>> versions more often, if they'd done this during 1.6 development we
>> wouldn't be installing production releases that you can crash after a
>> day of testing now.
>>
>> If QA is the argument for not doing releases more often, consider the
>> fact that known broken releases that you have to patch yourself with
>> patches hidden in bugzilla isn't much better.
>>
>> In reality, I think that doing non-QA'd snapshot releases might be
>> the
>> way to go. That is, releases with the useful more-or-less trivial
>> fixes that avoids crashes etc. and that will be included in the next
>> QA'd release. They would not be suitable for production, but at least
>> you can rather easily download the latest snapshot and try on your
>> test cluster and see if it fixes the problem(s) you've encountered.
>> And if it does, we can bug CFS until they get their act together and
>> gets a release out with the fix included.
>>
>> In the end, you have to realise that when you have a production
>> system
>> you don't want to wait for weeks and months for a new release that
>> might fix a crash-inducing bug you're hitting. I say might here,
>> because obviously having a fix hidden in bugzilla is no guarantee
>> that
>> it's included in a released version.
>>
>> In our case we're not at production yet because of these problems
>> with
>> getting fixes out quickly enough. So far we've always been able to
>> crash lustre 1.6 within days, and that's after waiting for 1.6 for
>> well over a year.
>>
>> So, I'd like to challenge CFS to get a version of lustre 1.6 (or 1.8,
>> whatever) out that proves stable on our small lustre test setup.
>> Without patches. In the year of 2007.
>>
>> Since the "internal QA only" approach obviously isn't working, I'd
>> suggest that you embrace "release early, release often" to get there.
>> That means one release per week as long as you have fixes pending to
>> get a decent churn on things.
>>
>>
>> /Nikke
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list