[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

Wed Apr 8 13:54:49 PDT 2009

Hi Kevin -- your usage sounds similar to ours, and the challenges you've 
faced are likely similar to what we're looking at.  I'd be interested in 
learning more about your architecture, and any recommendations that you 
have (ie. what would you do differently). 

One complication we have with back-ups is that (currently) our tape 
back-ups are done to a local Grid computing site, and the hardware is 
owned and maintained by them.  We need to use TSM to do our back-ups: 
I'm not sure that the 'infinite incremental' scheme of TSM would work 
well with Lustre.

Our current home-grown data management system uses vanilla Linux boxes 
as storage nodes, and a database (on Sybase) to manage files and file 
metadata.  To maintain file integrity, every file that is put into the 
system (using our API) is checksummed, and the on-disk files are 
compared to the metadata db by a continually cycling background task.  
Also, we pair-up storage nodes so that each file automatically gets put 
onto two nodes.  With the files on two identical nodes, we can take one 
down for maintenance while still having full access to the data, and can 
recover one node from its mirror.  This mirroring is in addition to the 
off-site tape back-up.

This system is great in its simplicity (we can recover the entire file 
management system from the contents of the storage node's file-system, 
although we've never had to), but either needs to be largely refactored 
or replaced (hence the interest in things like Lustre).  Lustre does not 
give file-management capabilities, so we were looking into using iRODS 
on top of Lustre.

I'm not sure what you mean by "we have set the default stripe to 1 
wide".  Does this affect how the blocks are written to disk?  One 
problem I forsee with backing up the OSTs is that each OST (might) only 
hold a fraction of a file, and without the MDS data you don't know what 
part of what file.

In your architecture, can you take OSS's offline without losing data 
access?  My suspicion is that we'd only get this if the OSTs of that 
host were also connected to another OSS.

Thx,
J.

Kevin Fox wrote:
>
> We currently use Lustre for an archive data cluster.
>
> df -h
> Filesystem            Size  Used Avail Use% Mounted on
> n15:/nwfsv2-mds1/client
>                       1.2P  306T  789T  28% /nwfs
>
> To deal with some of the archive issues (non hpc), we run the cluster a
> little differently then the norm. We have set the default stripe to 1
> wide. Our OSS's are white box, 24 1tb drives hanging off of a couple of
> 3ware controllers set up in raid 6's. This is much cheaper then the
> redundant fiber channel setups that you usually see in HPC Lustres.
>
> Because of the hardware listed above, Lustre backups/restores are a real
> pain. A normal backup would take forever dealing with a lustre of this
> size. We have implemented our own backup system to deal with it. It
> involves a modified e2scan utility and a fuse filesystem. If I can find
> some time, I plan on trying to release this GPL some day.
>
> One of our main requirements was to be able to restore an OSS/OST as
> quickly as possible if there was a failure. We separate and colocate
> each OST's data on tape to allow for quick restores. We have had a few
> OSS failures in the years of running the system and have been able to
> quickly restore just that OSS's data each time. Without this type of
> system, the tape drives would have to read just about all of the data
> stored on tape to get at the relevant bits. Since we have 208 OST's,
> restoring an OSS with this method gains us something like 52 times the
> performance.
>
> To deal with an OSS loss, we configure that OSS out. The file system
> continues as normal with any access to files residing on an affected OST
> throwing IO errors. In the mean time we can restore the OST's data. The
> stripe size was set to 1 so that it does not cross OSS's. That way if an
> OSS/OST is lost it doesn't affect as many files. Having the stripe size
> set to 1 is also assumed by the backup subsystem. It allows for the
> colocation I described above. My plan is to enhance the filesystem to
> handle stripe > 1 some day but have not been able to free up the time to
> do so yet.
>
> As for the MDS, I have code to try to back that up, but haven't used it
> in production or tested a restore of the data. What we usually do is
> take a dump of the MDS on our down times and have used that and the
> backups to restore the data in case of an MDS failure. We put in a lot
> more redundancy (raid 1 over two raid 6's on separate controllers) into
> our MDS then the rest of the system so we haven't had as many problems
> with it as the OSS's.
>
> As far as data corruption, lustre currently doesn't keep checksums so
> its really left up to the io subsystem to handle that. Pick a good one.
> We have had problems with our 3ware controllers corrupting data at times
> but so far, have been able to restore any affected data from backups. We
> have bumped into a few kernel/lustre corrupting bugs related to 32bit
> boxes and 2+TB block devices a few times but not in a while. We were
> able to restore data from backups to handle this to, but that story is a
> whole book unto itself.
>
> So, is Lustre as an archive file system doable, yes. is it recommended?
> Depends how much effort you want to put into it.
>
> Kevin
>
>
> On Tue, 2009-04-07 at 21:54 -0700, John Ouellette wrote:
> > Hi -- I work for an astronomical data archive centre which stores
> > about
> > 300TB of data.  We are considering options for replacing our current
> > data management system, and one proposed alternative uses Lustre as
> > one
> > component.  From what I have read about Lustre, it is largely targeted
> > to HPC installations and not data centres, so I'm a bit worried about
> > its applicability here.
> >
> > Although we do have a requirement for high throughput (although not to
> > really support 1000s of clients: more likely a few dozen nodes), our
> > primary concerns are reliability and data integrity.  From reading the
> > docs, it looks as though Lustre can be made to be very reliable, but
> > how
> > about data integrity?   Are there problems with data corruption?  If
> > the
> > MDS data is lost, is it possible to rebuild this, given only the
> > file-system data?  How easy is it to back-up Lustre?  Do you back-up
> > the
> > MDS data and the OST data, or do you back-up through a Lustre client?
>
> > Thanks in advance for any answers or pointers,
> > John Ouellette
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
>