[Lustre-devel] WBC HLD outline

Alexander Zarochentsev Alexander.Zarochentsev at Sun.COM
Wed Mar 25 01:17:48 PDT 2009

On 24 March 2009 02:17:33 Robert Read wrote:
> Hi Zam,

> > MD update: a part of MD operation to be executed on one server,
> > contains one or more MDS/RAW operations.
> Why does the client need to to be more granular than an update?  It
> seems MDS/Raw and update should be the same.

well, better to say an update is MDS op if the operation touch only one 
MD server and MDS/Raw op in case of distributed operation.

> > MD batch: a collection of per-server MD updates.
> >
> > MDTR: MD translator: translates MD operations into MD/Raw ones.
> Isn't this essentially what the cmm is doing today? (Breaking down
> distributed operations into per-node updates?)  Are you expanding on
> Alex's idea of creating a new generic MD server stack?

I just doubt that cmm code reuse is worth MD stack relayering. Can it be 
done as a subtask later?

> > *** WBC protocol
> >
> > WBC request contains a set of MD/RAW operations, tagged with one
> > epoch number.  Bulk transfers are used.
> All the updates in a single operation must have the same epoch, but I
> don't think we can guarantee that all the operations in a batch will
> be in the same epoch, unless we stop exchanging messages with all the
> MD servers. I don't see a need for them to be in the same epoch,
> either.

you are right.

> > *** File data
> > Flushing file data to the OST servers is delayed until file
> > creation is re-integrated.
> >
> > *** Recovery
> >
> > The redo-log preserved until it is not needed in recovery (i.e.
> > epoch gets stable)
> >
> > Client replay the log and re-execute all operations from it,
> > repeating MDTR processing (dispatching the operation between MD
> > servers).
> Since the MD servers all roll back before recovery, recovery will be
> very similar to the original reintegration, with the exception of
> using versions.  So we should try to keep the recovery (replay) code
> as similar to the normal code as possible, and move recovery higher
> into the stack.


> > **** WBC client eviction, uncompleted updates
> >
> > If client dies until re-integration is completed, there are three
> > choices:
> >
> > a) Cluster-wide rollback, all servers roll back to the last
> > globally stable epoch, then clients to replay heir redo-logs.
> >
> > This scenario should be avoided because a single client failure may
> > may stop whole cluster for recovery.
> >
> > b) All servers participating in re-integration coordinate to undo
> > uncompleted updates.
> >
> > c) The servers have all information needed to complete
> > re-integration w/o client.
> You mean by keeping the original operation info in the undo logs?

I meant the servers receive not updates but whole operations. If the 
client failed and didn't send an update to some of the servers, the 
operation can be completed w/o the client. It is an alternative to 
undoing of partial updates.

Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

More information about the lustre-devel mailing list