[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

Wed Apr 8 15:09:07 PDT 2009

On Wed, 2009-04-08 at 13:54 -0700, John Ouellette wrote:
> Hi Kevin -- your usage sounds similar to ours, and the challenges
> you've
> faced are likely similar to what we're looking at.  I'd be interested
> in
> learning more about your architecture, and any recommendations that
> you
> have (ie. what would you do differently).

Yup. This sounds very similar.
> 
> One complication we have with back-ups is that (currently) our tape
> back-ups are done to a local Grid computing site, and the hardware is
> owned and maintained by them.  We need to use TSM to do our back-ups:
> I'm not sure that the 'infinite incremental' scheme of TSM would work
> well with Lustre.

Actually, we are using TSM with incremental backups on top of the fuse
file sytem. We basically provide a subdirectory in the root of the file
system per OST, and then spawn off a backup run using
--virtual-node-name for each ost on that subdirectory.

We have 4 Dell 1950's each running the backup file system, and we run 16
tsm instances at a time on them until all ost's are backed up. It
usually takes ~10-12 hours to complete a backup pass.

This architecture was built to accommodate our Lustre 1.4 system. Rumor
has it that the later 1.6 releases can have a lustre client and oss on
the same box, and using the lbfs, you could then do backups from each
oss directly instead of through a set of backup nodes. I attempted this
with early 1.4 releases but it wasn't supported back then. Weird (weird)
stuff happed if you tried it back then. I've been meaning to try this
again, since it takes no code changes to the lbfs, but don't currently
have a suitable 1.6 lustre available.
> 
> Our current home-grown data management system uses vanilla Linux boxes
> as storage nodes, and a database (on Sybase) to manage files and file
> metadata.  To maintain file integrity, every file that is put into the
> system (using our API) is checksummed, and the on-disk files are
> compared to the metadata db by a continually cycling background task. 
> Also, we pair-up storage nodes so that each file automatically gets
> put
> onto two nodes.  With the files on two identical nodes, we can take
> one
> down for maintenance while still having full access to the data, and
> can
> recover one node from its mirror.  This mirroring is in addition to
> the
> off-site tape back-up.

Lustre doesn't currently support Raid1 striping. That would solve the
problem of taking one OST down. I don't know where that is on the road
map.

Raid1ing like your doing has the benefit of being able to take an OST
down. The drawback is space cost. We're using raid6's and haven't had
much data unavailability. We're using about 1/6 our space for
redundancy. Your using 1/2. I'm not sure, but I think it would probably
be cheaper to just make the OST's fiber channel attached and use raid 6
with OSS failover pairs then to raid1 everything.

Checksumming is on the roadmap I think.

If you striped 1, and gathered the metadata like I do with the e2scan
patch, you could checksum the data directly on the oss's. I've been
meaning on writing a system like this at some point (And actually have
had to do it manually once, in a disaster) but haven't had time to yet.

As far as raid1 pairing nodes, you might be able to hack something
together using drbd and oss failover. No clue if its been tried before.

> This system is great in its simplicity (we can recover the entire file
> management system from the contents of the storage node's file-system,
> although we've never had to), but either needs to be largely
> refactored
> or replaced (hence the interest in things like Lustre).  Lustre does
> not
> give file-management capabilities, so we were looking into using iRODS
> on top of Lustre.

I've been meaning to look more at iRODS, but haven't had time time. :)
If you go down that route, please let me know how you like it.

> I'm not sure what you mean by "we have set the default stripe to 1
> wide".  Does this affect how the blocks are written to disk?

Indirectly. Blocks are striped across OST's in a Raid0 manner. If you
set the stripe size to 1, the whole file is written to only one ost. If
you mount the underlying OST's file system and look at one of the files,
you see exactly what you see catting the file from a lustre client.

This makes backups and reliability better, but at the cost of
performance.

>   One
> problem I forsee with backing up the OSTs is that each OST (might)
> only
> hold a fraction of a file, and without the MDS data you don't know
> what
> part of what file.

Yup. This is why we stripe 1 wide.

> In your architecture, can you take OSS's offline without losing data
> access?  My suspicion is that we'd only get this if the OSTs of that
> host were also connected to another OSS.

Correct. We cant take an OSS down without the data being unavailable.

Kevin

> Thx,
> J.
> 
> Kevin Fox wrote:
> >
> > We currently use Lustre for an archive data cluster.
> >
> > df -h
> > Filesystem            Size  Used Avail Use% Mounted on
> > n15:/nwfsv2-mds1/client
> >                       1.2P  306T  789T  28% /nwfs
> >
> > To deal with some of the archive issues (non hpc), we run the
> cluster a
> > little differently then the norm. We have set the default stripe to
> 1
> > wide. Our OSS's are white box, 24 1tb drives hanging off of a couple
> of
> > 3ware controllers set up in raid 6's. This is much cheaper then the
> > redundant fiber channel setups that you usually see in HPC Lustres.
> >
> > Because of the hardware listed above, Lustre backups/restores are a
> real
> > pain. A normal backup would take forever dealing with a lustre of
> this
> > size. We have implemented our own backup system to deal with it. It
> > involves a modified e2scan utility and a fuse filesystem. If I can
> find
> > some time, I plan on trying to release this GPL some day.
> >
> > One of our main requirements was to be able to restore an OSS/OST as
> > quickly as possible if there was a failure. We separate and colocate
> > each OST's data on tape to allow for quick restores. We have had a
> few
> > OSS failures in the years of running the system and have been able
> to
> > quickly restore just that OSS's data each time. Without this type of
> > system, the tape drives would have to read just about all of the
> data
> > stored on tape to get at the relevant bits. Since we have 208 OST's,
> > restoring an OSS with this method gains us something like 52 times
> the
> > performance.
> >
> > To deal with an OSS loss, we configure that OSS out. The file system
> > continues as normal with any access to files residing on an affected
> OST
> > throwing IO errors. In the mean time we can restore the OST's data.
> The
> > stripe size was set to 1 so that it does not cross OSS's. That way
> if an
> > OSS/OST is lost it doesn't affect as many files. Having the stripe
> size
> > set to 1 is also assumed by the backup subsystem. It allows for the
> > colocation I described above. My plan is to enhance the filesystem
> to
> > handle stripe > 1 some day but have not been able to free up the
> time to
> > do so yet.
> >
> > As for the MDS, I have code to try to back that up, but haven't used
> it
> > in production or tested a restore of the data. What we usually do is
> > take a dump of the MDS on our down times and have used that and the
> > backups to restore the data in case of an MDS failure. We put in a
> lot
> > more redundancy (raid 1 over two raid 6's on separate controllers)
> into
> > our MDS then the rest of the system so we haven't had as many
> problems
> > with it as the OSS's.
> >
> > As far as data corruption, lustre currently doesn't keep checksums
> so
> > its really left up to the io subsystem to handle that. Pick a good
> one.
> > We have had problems with our 3ware controllers corrupting data at
> times
> > but so far, have been able to restore any affected data from
> backups. We
> > have bumped into a few kernel/lustre corrupting bugs related to
> 32bit
> > boxes and 2+TB block devices a few times but not in a while. We were
> > able to restore data from backups to handle this to, but that story
> is a
> > whole book unto itself.
> >
> > So, is Lustre as an archive file system doable, yes. is it
> recommended?
> > Depends how much effort you want to put into it.
> >
> > Kevin
> >
> >
> > On Tue, 2009-04-07 at 21:54 -0700, John Ouellette wrote:
> > > Hi -- I work for an astronomical data archive centre which stores
> > > about
> > > 300TB of data.  We are considering options for replacing our
> current
> > > data management system, and one proposed alternative uses Lustre
> as
> > > one
> > > component.  From what I have read about Lustre, it is largely
> targeted
> > > to HPC installations and not data centres, so I'm a bit worried
> about
> > > its applicability here.
> > >
> > > Although we do have a requirement for high throughput (although
> not to
> > > really support 1000s of clients: more likely a few dozen nodes),
> our
> > > primary concerns are reliability and data integrity.  From reading
> the
> > > docs, it looks as though Lustre can be made to be very reliable,
> but
> > > how
> > > about data integrity?   Are there problems with data corruption?
> If
> > > the
> > > MDS data is lost, is it possible to rebuild this, given only the
> > > file-system data?  How easy is it to back-up Lustre?  Do you
> back-up
> > > the
> > > MDS data and the OST data, or do you back-up through a Lustre
> client?
> >
> > > Thanks in advance for any answers or pointers,
> > > John Ouellette
> > >
> > >
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss at lists.lustre.org
> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> > >
> > >
> >
> 
>