[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

Kevin Fox Kevin.Fox at pnl.gov
Wed Apr 8 13:26:09 PDT 2009


We currently use Lustre for an archive data cluster.

df -h
Filesystem            Size  Used Avail Use% Mounted on
n15:/nwfsv2-mds1/client
                      1.2P  306T  789T  28% /nwfs

To deal with some of the archive issues (non hpc), we run the cluster a
little differently then the norm. We have set the default stripe to 1
wide. Our OSS's are white box, 24 1tb drives hanging off of a couple of
3ware controllers set up in raid 6's. This is much cheaper then the
redundant fiber channel setups that you usually see in HPC Lustres.

Because of the hardware listed above, Lustre backups/restores are a real
pain. A normal backup would take forever dealing with a lustre of this
size. We have implemented our own backup system to deal with it. It
involves a modified e2scan utility and a fuse filesystem. If I can find
some time, I plan on trying to release this GPL some day.

One of our main requirements was to be able to restore an OSS/OST as
quickly as possible if there was a failure. We separate and colocate
each OST's data on tape to allow for quick restores. We have had a few
OSS failures in the years of running the system and have been able to
quickly restore just that OSS's data each time. Without this type of
system, the tape drives would have to read just about all of the data
stored on tape to get at the relevant bits. Since we have 208 OST's,
restoring an OSS with this method gains us something like 52 times the
performance.

To deal with an OSS loss, we configure that OSS out. The file system
continues as normal with any access to files residing on an affected OST
throwing IO errors. In the mean time we can restore the OST's data. The
stripe size was set to 1 so that it does not cross OSS's. That way if an
OSS/OST is lost it doesn't affect as many files. Having the stripe size
set to 1 is also assumed by the backup subsystem. It allows for the
colocation I described above. My plan is to enhance the filesystem to
handle stripe > 1 some day but have not been able to free up the time to
do so yet.

As for the MDS, I have code to try to back that up, but haven't used it
in production or tested a restore of the data. What we usually do is
take a dump of the MDS on our down times and have used that and the
backups to restore the data in case of an MDS failure. We put in a lot
more redundancy (raid 1 over two raid 6's on separate controllers) into
our MDS then the rest of the system so we haven't had as many problems
with it as the OSS's.

As far as data corruption, lustre currently doesn't keep checksums so
its really left up to the io subsystem to handle that. Pick a good one.
We have had problems with our 3ware controllers corrupting data at times
but so far, have been able to restore any affected data from backups. We
have bumped into a few kernel/lustre corrupting bugs related to 32bit
boxes and 2+TB block devices a few times but not in a while. We were
able to restore data from backups to handle this to, but that story is a
whole book unto itself.

So, is Lustre as an archive file system doable, yes. is it recommended?
Depends how much effort you want to put into it. 

Kevin


On Tue, 2009-04-07 at 21:54 -0700, John Ouellette wrote:
> Hi -- I work for an astronomical data archive centre which stores
> about
> 300TB of data.  We are considering options for replacing our current
> data management system, and one proposed alternative uses Lustre as
> one
> component.  From what I have read about Lustre, it is largely targeted
> to HPC installations and not data centres, so I'm a bit worried about
> its applicability here.
> 
> Although we do have a requirement for high throughput (although not to
> really support 1000s of clients: more likely a few dozen nodes), our
> primary concerns are reliability and data integrity.  From reading the
> docs, it looks as though Lustre can be made to be very reliable, but
> how
> about data integrity?   Are there problems with data corruption?  If
> the
> MDS data is lost, is it possible to rebuild this, given only the
> file-system data?  How easy is it to back-up Lustre?  Do you back-up
> the
> MDS data and the OST data, or do you back-up through a Lustre client?

> Thanks in advance for any answers or pointers,
> John Ouellette
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 




More information about the lustre-discuss mailing list