[Lustre-discuss] failure rates

Sun Apr 26 15:51:16 PDT 2009

Hi John:

>From our experience (1 year), we simply love Lustre. Couple of lessons
I learned regarding stability:

1) Always have the latest e2fs progs
2) I recommend not to get the latest version available version
3) Run on good hardware. Test your hardware with iozone and bonnie for
72 hours straight so you know its legit hardware.
4) Have a reliable network (naturally)
5) Compile your own kernel for OST/MDT and patch. Apply the patches
from bugzilla "release tickets"

Also, make sure you take the time to appreciate what Lustre is really doing :-)

Hope this Helps!

On Fri, Apr 24, 2009 at 1:11 PM, John White <jwhite at lbl.gov> wrote:
>
> On Apr 24, 2009, at 9:59 AM, Brian J. Murrell wrote:
>
>> On Fri, 2009-04-24 at 09:48 -0700, John White wrote:
>>>
>>>      I wonder if anyone has any failure metrics on their specific
>>> installations.  We're quite new to the lustre space and wanted to get
>>> a feel for what we might be in for downtime-wise.  In particular,
>>> does
>>> anyone have numbers for the mean time between failure and mean time
>>> to
>>> repair?
>>
>> I think this is a very subjective question.  To a great deal it's
>> going
>> to depend on how much you spend on your infrastructure.  If you buy
>> cheap(ly built) hardware, it will most likely fail more often than
>> better built hardware.
>
> Oh, naturally.  I suppose I was short on details.  The question is
> more geared at the software side of things.  Of course you can build
> in hardware redundancy on the back-end, set up failover on the server-
> end, etc.  Beyond those, I'm curious how often software unavoidably
> "flips-out" under lustre and how long these commonly take to recover
> from.  Say the lock manager tweaks, etc.
>
> I know this is a rather difficult metric to quantify, especially after
> experiences with.. other.. parallel filesystems.  Perhaps people have
> numbers for their specific configuration?
>
>>
>>
>> Additionally, given Lustre's HA abilities, uptime is something you can
>> throw money at (or not).  If you have a high amount of redundancy in
>> your architecture, including failover pairs and so on, then downtime
>> is
>> reduced as your redundant hardware kicks in to provide uptime where it
>> would have not been had you not spent on and built that redundant
>> architecture.
>>
>> There are probably lots of places where the same kind of arguments can
>> be made, making the question all that more subjective.
>>
>> b.
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>