[Lustre-devel] RAID-1 SNS / migration discussion summary

Mon Jan 12 16:26:40 PST 2009

Further discussion on migration using RAID-1 uncovered a number of issues
that need careful attention.  I don't think I've captured all of them
here, so I'd welcome some review of this document.

While full RAID-1 coherency while to the file is much nicer technically,
it will be significantly more complex to implement (more time, more
bugs), and we could have basic functionality earlier with the "simple
space balance migration".  That is similar to the proposal to have a
"basic HSM" (blocks IO during copyin) and "complex HSM" (allows file IO
during copyin ASAP when data is available).

As with "basic HSM", "simple space balance migration" clients would
be blocked during file access if the file is being migrated, with the
option of killing the migration if it is estimated to take too long.
The clients would also be blocked on the MDS layout lock during migration,
as with HSM.

Below is the description of migration using RAID-1.  Most of the
mechanism is in the RAID-1 functionality, very little of it relates to
migration itself.  It would also be desirable if the implementation of
RAID-1 was agnostic to the number of data copies, because if we need
to migrate a RAID-1 object this might need 3 copies of the data at one
time, and some environments may want to have multiple copies of the data
(e.g. remote caches, many replicas of binaries).

-------------------------------------

A client initiates migration by requesting the MDT for this file to
change the LOV EA layout to instantiate a second mirror copy of the file.
When a file migration is requested, the MDS handles this by revoking the
file layout from all clients and adding a new RAID-1 mirror to the layout.

We didn't discuss specifics on how this mirror file should be created,
but in light of the later discussion about HSM copy-in I'll suggest that
the file be created by the client using normal file striping parameters,
and then request that the MDT "attach" the new file as an additional
mirror copy.

The new objects of the RAID-1 mirror would be marked "stale" in some
manner (in the MDS layout, or on the objects themselves as is proposed
for HSM).  Eric and I also discussed a "stale map" for each object that
is persistent on disk, so that an object can be partially updated and
reads can be satisfied from the valid parts of the disk even in the case
of multiple OST failures.

A simplifying assumption was to keep the stripe size the same on both
copies of the file, so that a chunk on one OST maps directly to another
OST, instead of possibly being split in the middle.

All reads from the file will only be handled by the valid mirror(s).
It should be possible to do reads from either copy of the file by only
getting a single lock on that object+extent.  Writes need to have a
write lock over the same extents on all copies while writing.  This will
allow the file to continue being used while it is being resynced.

The writes in the filesystem are done via COW (as in ldiskfs hardening)
and the llog records are atomically committed with the object's metadata
describing the newly-allocated extent update to ensure that if the OST
crashes that the old file data is not overwritten.  This implies that
non-COW backing filesystems cannot participate in RAID-1.

Writes to each stripe will cause the local OST to generate an llog record
that describes what part of the object was modified, and the llog cookie
will be sent back to the client in the reply.  These llog records will in
essence be "stale data map" updates for the _remote_ objects.

[Q] We discussed having "tags" that are sent with each write, so that the
    secondary copy knows which llog cookies are cancelled with each write.
    We would need to have a way for tags to be (relatively) unique and
    generated by the clients, because false collisions could result in
    missed data updates on the backup objects.

[Q] For lockless IO the "tags" on the writes are critical because there
    is no coherent locking of the object between OSTs, unless the OST
    itself is doing the mirroring while locking the remote object?  How
    would we detect racing overlapping IO to different copies of the file?

We said during our discussion that when the write is complete on the
mirror the client will cancel the llog cookie to indicate that both
sides of the write are up-to-date.

[Q] What happens on a write-caching OST?  The initial writes will
    generate an llog cookie on one side, and the cookie will be
    cancelled by the client.  Instead it seems that the client needs
    to pass the cookie on to the mirror OST and they are only cancelled
    when the data is persistent on disk (one transaction later).

The resync of the new copy proceeds by the client/agent reading the
file data (from the non-stale copy only) and writing data to the
mirror copy.

If a client gets a timeout when writing to one stripe after having
written to the partner stripe, then it is up to the OSTs to do recovery
of the stale parts of the file. <begin hand waving> The object on the
updated OST needs to be able to detect that the other copy was not
updated independently of the client (presumably by a timeout), and then
cause the other OST to replay its llog records.

A similar mechanism will be needed in case an up-to-date mirror's OST
goes offline and writes are not being sent there.

We have currently mandated a restriction that the stripe size of both
copies be the same, in order to facilitate logging of updates.  If the
stripe size is the same then a write to one chunk of an object will map
to whole chunk on the mirror copy.  Nikita has suggested that the OSTs
keep a copy of the LOV EA locally so that each OST can generate appropriate
llog update records.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

----- End forwarded message -----

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

----- End forwarded message -----

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.