[Lustre-discuss] OSS and MDS resilience to power failures

Christopher J. Morrone morrone2 at llnl.gov
Fri Jan 20 15:29:36 PST 2012


We have never had battery backup on our OSS nodes and we have been 
successful in that mode.

Years ago, powering off an OSS or MDS uncleanly was very dangerous.  A 
lot of work went into fixing ext/ldiskfs, and we have been reasonably 
successful at surviving power outages in production for a few years now.

In all of our development and testing work, unclean MDS/OSS power-offs 
are our standard practice (partly because cleanly shutting down an 
active server has historically been next to impossible...).  So we very 
frequently validate that this is still reasonably safe.

However, we have seen so many ext/ldiskfs bugs over the years that we 
have decided to make "fsck.ldiskfs -p" (a quicker "preen" fsck) standard 
practice at boot time before starting and MDT or OST.  That at least 
provides us with a partial sanity check.  Honestly, we would probably 
prefer to do a full fsck every time, but the time to do that is not 
acceptable as standard practice.

I would warn that if you are using large LUNs, there may be a regression 
that we have just opened LU-1015 about.

   http://jira.whamcloud.com/browse/LU-1015

But evaluating that is still in the early stages.

Chris

On 01/13/2012 03:13 AM, Wojciech Turek wrote:
> Make sure that use use RAID controller's with cache protected by battery backup and if you use redundant controllers that the cache mirroring feature is enabled. The ldiskfs (ext4) should recover after power failures with no problems as long as the back end storage recovers fine too.
>
> Best regards,
>
> Wojciech
>
> On 13 January 2012 11:01, Alexander Oltu<Alexander.Oltu at uni.no<mailto:Alexander.Oltu at uni.no>>  wrote:
> Hi,
>
> We are going to have a Lustre setup with around 1000 clients. Due to a
> power cable constraints we are not able to provide power to OSS and MDS
> servers from UPS.
>
> Therefore the question is how resilient is Lustre to OSS and/or MDS
> power failures?
>
> We have quite few thunderstorms here during summer with short power
> interruptions. I understand that ext4 is a journaling filesystem and
> should be more or less stable to power interruptions. But the real
> practice can be different. I believe that some of you have experience
> of running OSS and MDS without UPS and can shed some light on this
> topic.
>
> Thank you,
> Alex.
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org>
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>




More information about the lustre-discuss mailing list