[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

Sat Jul 3 13:03:58 PDT 2010

>> I wrote a blog post that pertains to Lustre scalability and
>> data integrity. You can find it here:
>> http://braamstorage.blogspot.com

Ah amusing, but a bit late to the party. The DBMS community have
been dealing with these issues for a very long time; consider the
canonical definitions of "database" and "very large database":

* "database": a mass of data whose working set cannot be held
  in memory; a mass of data where every access involves at least
  one physical IO.

* "very large database": a mass of data that cannot be
  realistically taken offline for maintenance; a mass of data
  that takes "too long" to backup or check.

But I am very pleased that the "fsck wall" is getting wider
exposure, I have been pointing it out in my little corner for
years.

> [ ... ] like Veritas already solved this by

> 1. Integrating the Volume management and File system. The file
>    system can be spread across many volumes.

That's both crazy and nearly pointless. It is at best a dubious
convenience.

> 2. Dividing the file system into a group of file sets(like
>    data, metadata, checkpoints) , and allowing the policies to
>    keep different filesets on different volumes.

That's also crazy and nearly pointless, as described.

> 3. Creating the checkpoints (they are sort of like volume
>    snapshots, but they are created inside the file system
>    itself). [ ... ]

These are an ancient feature of many fs designs, and for various
reasons versioned filesystems have never been that popular. In
part because of performance, in part because it is not that
useful, in part because it is the wrong abstraction levbel.

> 4. Parallel fsck - if the filesystem consists of the
>    allocation units - a sort of the sub- file systems, or
>    cylinder groups, then the fsck can be started in parallel
>    on those units.

This either is pointless or not that useful. This can be done
fairly trivially by using many filesystems, and creating a
single namespace by "mounting" them together; of course then one
does not have a single free storage pool, even if the namespace
is stitched together.

But it is exceptionally difficult to have a single storage pool
*and* chunking (as soon as object contents are spread across
mutiple chunks 'fsck' becomes hard, and if objects contents are
not spread across multiple chunks, you don't really have a single
storage pool).

The fundamental problem with 'fsck' is that:

* Data access scales up by using RAID, as N disks, with suitable
  access patterns, give a speedup of up to N (either in bandwidth
  or IOPS), so it is feasible to create very large storage
  systems by driving parallelism up at the data level.

* Unfortunately while data performance *can* scale with the
  number of disks, metadata access cannot, because it is driven
  by wholly different access patterns, usually more graph-like
  than stream-like. In essence 'fsck' is a garbage collector, and
  thus it is both unavoidable, and exceptionally hard to
  parallelize.

Note also that the "IOPS wall" (similar to the "memory wall"),
where storage device capacity and bandwith grow faster than IOPS,
eventually calls into question even data scalability, and in some
applications (like the Lustre MDS) that is already quite apparent.

> Well, the ZFS does solve many of these issues, but in a
> different way, too.

ZFS is not the solution to almost any problem, except perhaps
sysadmin convenience.

The UNIX lesson is that the main job of a file system is to
provide a simple, trivial "dataspace" abstraction layer, and that
trying to have it address storage (for example checksumming) or
application layer (for example indices) concerns is poor design.
It does seem quite convenient though (to the sort of people who
want to do triple parity RAID and 46+2 RAID6 arrays, or build
large filesystems as LVM2 concats [VGs] spanning several disks).

> So, my point is that this probably has to be solved on the
> backend side of the Lustre, rather than inside the Lustre.

The Lustre has embodies a very specific set of tradeoffs aimed at
a specific "sweet spot" as described by PeterB in his blogpost.
Violating design integrity usually is very painful. A wholly new
design is probably needed.

As to scalability there is a proof of existence for extremely
scalable file system designs, and that is GoogleFS, and it
embodies pretty extreme tradeoffs (far more extreme than Lustre)
in pursuit of scalability.

If GoogleFS is the state of the art, then I suspect that very
scalable, fine grained, and highly efficient are incompatible
goals (and very, very rarely a requirement either).

BTW I am occasionally reminded of two ancient MIT TRs, one by
Peter Bishop about distributed persistent garbage collection, and
one by Svobodova on object histories in the swallow repository.