[Lustre-discuss] Can lustre be trusted to keep my data safe?

Andreas Dilger adilger at sun.com
Wed May 14 16:24:46 PDT 2008


On May 14, 2008  14:21 -0400, jrs wrote:
> I work for a small/medium company that does image processing.
> We have about 700TB of data presently and might be at 2PB within
> the next couple of years.  Owing to the amount of data we don't
> make backups for most of it and trust raid 6 on our hardware raid
> boxes (nexsan Satabeast) to fail more slowly than we can replace
> disks.  Over the last couple of years we've had great luck and,
> I believe, have never lost data owing to a failure with this
> hardware (software or human error is another matter ;-).
> However, the unbacked up data is "mission critical."  Though
> it can, probably, all be reconstructed or reacquired, as a practical
> matter losing a significant quantity of this data could be
> catastrophic for our business.
> 
> So, what do you think, can lustre be trusted to keep our
> data safe at our company?  Assume in answering that we have
> failover working properly.  We can also withstand some blocking
> of the filesystem while a failover event completes, i.e., not
> having the filesystem available for some amount of time is
> not a problem, but having directory important-data/ disappear
> is a HUGE problem.

You are confusing two separate ideas - availability and backup.

Having RAID1/5/6 and failover allows for data to be accessible in
the face of hardware failures without (much) interruption.

Having a second copy of your data allows for data to be accessible
(usually after a longer delay) in a much wider range of scenarios,
like multiple hardware failure, software errors, human errors,
site catastrophe, etc.

There have been a few customer incidences recently where a user (whether
malicious or uninformed), or malformed script was deleting filesystem
data at a very high rate, and by the time someone noticed the problem
hundreds of TB of data had been deleted in each case.  That is nothing
that RAID6 or failover will save you from.

Similarly, even with RAID6 it is possible to have multiple-drive
failures after events like power outages because usually all of
the drives in a RAID set are from the same manufacturing batch
and are more likely to fail at one time.  Very large sites that
have annual power maintenance outages have enough of these kinds
of failures to advertise users back up their important files
before the outage.



So, I think the important point I'm making is that no matter how
reliable Lustre (or any storage) is, not having any proper backup
is asking for trouble in the long run.

In my opinion, if you have a large shared filesystem, a user-driven
backup system is the best model.  Users are the ones best informed
of what data is the most important to keep, and if the onus of backup
is communicated to them clearly they only have themselves to blame.

If you use Lustre for a single data repository for some application,
and all of the files are equally important, then my only suggestion
is to go to some configuration with a full second copy of the data
that is updated on a regular (though not continuous) basis.  If it
is updated continuously then any "rm -r" kind of error will also
propagate to the backup too quickly.

The backup system can be MUCH less performant than the primary copy,
and you can do things like oversubscribe the OSTs to single OSS nodes,
and have less RAM on the servers.  Considering that a low-performance
700TB filesystem can probably be built for a cost of around $200k
you have to weigh the costs of this against the potential business
cost of losing some or all of your data.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list