[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

Sat Apr 11 07:10:40 PDT 2009

> Hi -- I work for an astronomical data archive centre which
> stores about 300TB of data.[ ... ] Although we do have a
> requirement for high throughput (although not to really
> support 1000s of clients: more likely a few dozen nodes),

That "few dozen nodes" determines a bit what kind of performance
you need to achieve, and how, because Lustre has indeed huge
performance but *in the aggregate*: that is it can achieve
100GB/s of *aggregate* throughput on a 1Gb/s network by having
1,000 clients each transferring at 100MB/s to/from 1,000
servers.

> our primary concerns are reliability and data integrity.

Reliability and data integrity on a 300TB archive is an unsolved
research problem, as long as you really mean them. There are a
number of snake-oil salesmen that will promise them to you though.

> From reading the docs, it looks as though Lustre can be made
> to be very reliable,

Reading the docs? They are *very clear* that there is no way to
make Lustre *as such* very reliable (until replication is
implemented by Lustre itself).

Lustre currently has the reliability of 1/Nth (where N is the
striping) of the underlying storage system, the (un)reliability
of its own software layer reducing that further.

> but how about data integrity?  Are there problems with data
> corruption?

Worrying about file system level handling of data corruption in a
"astronomical data archive" may be the wrong approach.

For data curation *filesystem* (and even more so *storage*)
integrity are convenience and performance issues, not a data
integrity issue. Data integrity must be end-to-end. That is,
you cannot use a file system of any sort of a data archive.
A data archive is a rather different thing from a file system
even if it resides in a file system.

> If the MDS data is lost, is it possible to rebuild this, given
> only the file-system data?

No. Symbolic names are only on the MDS.

> How easy is it to back-up Lustre?

It is extremely easy. Backing up 300TB of data is hard.

> Do you back-up the MDS data and the OST data, or do you
> back-up through a Lustre client?

Either way, but again backing up 300TB of data is hard.

> Thanks in advance for any answers or pointers,

The above are the answers to the questions asked, but those
questions seemed to be quite misguided, because what Lustre is
and does is quite obvious, and some of the previous followups to
your questions are misguided or confused a bit.

I'll try first to explain clearly what Lustre is, and then to
reply to some similar questions but perhaps more appropriate ones.

Lustre is a data-parallel directory based chunked single namespace
single storage pool network metafilesystem:

* Metafilesystem: it uses other file systems as storage devices,
  instead of using block devices.

* Network: Lustre is in essence a set of network protocols. There
  is also some data representation, but one can only use Lustre
  over a network. The network part of the Lustre implementation
  (LNET) is probably more important than the rest.

* Directory based: which metafilesystem file name corresponds to
  which base filesystem file(s) is kept in a separate directory.

* Chunked: the metafilesystem can be an aggregation of many
  independent but related base filesystems, the "chunks".

* Single namespace: the directory service maintains a single
  metafilesystem namespace over all the base filesystems.

* Single storage pool: the directory service maintains a single
  pool of available space over all the base filesystems. If a file
  is striped, it can be larger than any single base filesystem.

* Data-parallel: once the list of base filesystem files for a
  metafilesystem name has been obtained by a client from the
  directory server, and the base filesystem files have been
  "mounted", they can be accessed in parallel. The directory server
  cannot (yet).

In a simplified way, consider that two files 'a' and 'b' in
directory 'd' are implemented thusly (if 'a' is not striped and 'b'
is, and there are two data servers):

  * lustre://dirserv/d/a with inum I1
    - lustrei://dataserv1/I1

  * lustre://dirserv/d/b with inum I2
    - lustrei://dataserv1/I2-1
    - lustrei://dataserv2/I2-2
    - lustrei://dataserv1/I2-3
    - lustrei://dataserv2/I2-4
    ....

That's basically all... :-)

The Lustre client software is in effect an extended 'autofs'/'amd',
and treats the Lustre directory server as a kind of LDAP server
containing a list of automount pairs; thus it creates a top level
mount point for each Lustre namespace, noting for each of those
which directory server it corresponds to.

Following the example above the Lustre client creates a top level
mount point such as '/mnt/l' with reference to 'lustre://dirserv/';
as processes access paths under the mountpoint it auto-"mount"s
each of the underlying files, so that if a processes accesses
'/mnt/l/d/a' what happens is that 'lustrei://dataserv1/I1' is
auto-"mounted" as '/mnt/l/d/a', and 'lustrei://dataserv[12]/I2-*'
as '/mnt/l/d/b/I2-*' (and in the latter case it provides the
illusion to processes that '/mnt/l/d/b' is a single file).

The good questions to ask here are:

* For a 300TB data archive, do we need a single name space and a
  single storage pool?

  Well, it really depends. Probably not, but it is convenient.

* Is there anything better or more cost effective than Lustre for a
  300TB data archive?

  Likely not. Unless you don't care for single namespace or single
  storage pool.

* How to backup a 300TB archive?

  Well, that's a research question. I personally think that the
  only practical way is another 300TB archive. Other people think
  tape libraries can be used. Perhaps...

* What do you mean by "reliable"?

  It can be about loss of service or loss of data.

  Lustre service can be made fairly reliable by redundancy in the
  network and in the directory and data servers, within limits.

  Lustre data can be made quite reliable with redundancy in the
  storage subsystems of the directory and storage servers.

  There is unavoidable common mode failure in the use of the same
  metafilesystem and filesystem code.

  In practice 'ext3'/'ext4' are pretty reliable and Lustre itself
  is pretty good too.

* What do you mean by "integrity"?

  It can be about detecting the loss of integrity or recovering
  from a loss of integrity.

  Detecting loss integrity in the data must be done *at least*
  end-to-end. It can be done at lower levels too, but that is not
  sufficient.

  Restoring integrity can be done by detecting loss of integrity in
  the representation of the data (disk blocks, links, metadata,
  ...) and using redundancy *as a matter of convenience*.

  Lustre does a bit of detecting loss of integrity in the metadata,
  and for now no integrity restoration. In this it has overall the
  same integrity properties as the underlying filesystem and
  storage systems, minus the integrity issues of the directory
  system. In practice it is pretty good.

My impression is that even if the Lustre design is definitely
targeted at coarse grained data parallel computation, not archival,
it is handy to use it as a kind of single namespace and single
storage pool anyhow, as there are quite few practical/low cost
alternatives, and it is fairly easy to setup.

But the really critical factors are the design of the data archive
(the level above) and the storage/network system (the level below).