[Lustre-devel] global epochs [an alternative proposal, long and dry].

Tue Dec 23 03:06:50 PST 2008

Alex Zhuravlev writes:
 > Nikita Danilov wrote:
 > > If we have no more than 1 reintegration in a given epoch on a given
 > > client, then the server that received an OP = (U(0), ..., U(N)) in epoch
 > > E from a client, can send to SC a message telling it that this client
 > > contains N volatile updates in epoch E, and whenever some server commits
 > > one of U's it sends to SC a message asking it to decrease a counter for
 > > this client. Most obvious implementation will batch these notification,
 > > i.e., when a server commits a transaction group it notifies SC about all
 > > changes in one message. I personally don't think that is the best
 > > approach.
 > 
 > essentially this is very similar to dependency-based recovery, but with
 > no it's advantages and with SC tracking all states and being single point
 > of failure. I think we need more scalable solution.

We are talking about few megabytes of data in network or in memory. It's
easy to replicate this state.

 > 
 > > Yes, and this mechanism (if it is correct at all) will guarantee that an
 > > epoch cannot depend on a future epoch.
 > 
 > again, it's not about dependency, it's about network overhead of global epochs.

Again, global epochs do not depend on DLM to propagate epochs. E.g.,
lockless IO can be implemented without any additional rpcs.

 > 
 > >  > just to list my observations about global epochs:
 > >  >   * it's a problem to implement synchronous operations
 > >  >   * network overhead even with local-only changes depending on workload
 > >  >   * disk overhead even with local-only changes
 > >  >   * SC is a single point of failure with any topology as it's the only place to
 > >  >     find final minimum
 > >  >   * tree reduction isn't obvious thing because client can't report its minimum
 > >  >     to any node, instead tree is rather static thing and any change should be
 > >  >     done very carefully. otherwise it's very easy to lose minimum
 > > 
 > > Unfortunately, as far as I know, no other solution was described with a
 > > level of detail sufficient to compare. :-)
 > 
 > I could say the same about tree reduction, for example ;)

Tree reduction is but an optimization. I am pretty convinced that core
algorithm works, because this can be proved.

 > 
 > dependency-based recovery was discussed with many details I think.

Let's see...

>   * when client issues transaction it labels it with unique id
>   * server executing operation write atomically undo record with:
>     * VBR versions so that we can build chains of really depended operations
>     * unique transaction id generated by client
>     * number of servers involved in transaction
>   * periodically servers exchange their committed unique transaction ids
>     (only distributed transaction are involved in this)
>   * once some distributed transaction is committed on all involved servers, we can prune
>     it and all its local successors

Either I am misunderstanding this, or this is not correct, because not
only a given operation, but also all operations it depends on have to be
committed, and it is not clear how this is determined.

One reason I wrote so lengthy a text was that I want to spell out
everything explicitly and unambiguously (and obviously failed in the
latter, as ensued discussion has shown).

Nikita.