[Lustre-devel] global epochs [an alternative proposal, long and dry].

Nikita Danilov Nikita.Danilov at Sun.COM
Mon Dec 22 06:21:26 PST 2008

Alexander Zarochentsev writes:
 > On 22 December 2008 15:45:51 Nikita Danilov wrote:
 > > Alex Zhuravlev writes:
 > >  > Hello,
 > >
 > >  > I'm not sure it scales well as any failed node may cause global
 > >  > stuck in undo/redo pruning.
 > >
 > > Only until this node is evicted, and I think that no matter what is
 > > the pattern of failures, a single level of `tree reduction', can be
 > > delayed by no more than a single eviction timeout.
 > It introduces unneeded dependency between nodes, any node cannot prune 
 > its own undo logs if all nodes have an agreement that the epoch can be 
 > pruned. IMO it is what scalable system should avoid. 

This is a price paid for the cheap introduction of new epochs. If epoch
scope is limited to a known group of nodes, then retiring such an epoch
requires consensus only between nodes of this group (cheaper than a
global consensus), but introduction of new epochs requires coordination
between groups. In various designs that we considered where epochs are
per-client this manifests itself as an absence of total ordering between
epochs that requires translation between client epochs and server
transaction identifiers.

All in all, I have a feeling that _all_ such algorithms have similar
communication overhead for the `usual' workload.

 > If we would have a disaster in a part of the cluster, client nodes would 
 > disconnect and reconnect often, the undo logs will be overloaded, and 
 > the cluster will stop, no?

Well it won't stop, because a node either manages to reconnect in time
(in which case it communicates its state to the superior), or it is
evicted on a timeout. In any case, stabilization algorithm progresses.

Then, I think that even the simplest global epoch based recovery is very
challenging to implement.

 > Thanks,
 > -- 
 > Alexander "Zam" Zarochentsev


More information about the lustre-devel mailing list