[Lustre-devel] global epochs [an alternative proposal, long and dry].

Alex Zhuravlev Alex.Zhuravlev at Sun.COM
Fri Dec 26 01:01:43 PST 2008

Nikita Danilov wrote:
> ... and all updates of the operation itself are committed on the
> respective servers.


>  > (2) if update U1 executed before update U2 and U2 is committed, then U1 must be committed
> I think this is only valid when U1 and U2 are on the same server. And
> even in this case this is probably required only when U1 and U2 are
> conflicting.

agree about same server. i think this is model used by ext3 and DMU.

>  > (3) requirement: if operation O2 depends on operation O1, then O1 has conflicting
>  >      update on same server with O2
> Agree, provided that `depends' means `directly depends', i.e., not
> through some intermediate operation.


>  > (4) operation is globally committed if all updates this operation consists of are
>  >      committed and everything it depends on is committed as well
> I think this is wrong. Everything it depends on must be _globally_
> (recursively) committed as well. Otherwise in the following scenario

additional clarification: "all updates this operation consists of are globally committed"

> As a note, I tried very hard to avoid confusion by using different
> terms: operations (a distributed state update) vs. transaction (a group
> of updates on a given server that reaches persistent storage
> atomically), and `stabilizes' vs. `commits' respectively.

I like the terms.

> Err.. what if U3 and U4 are committed on S1 and S2, but S0 hasn't
> received U1 at all (e.g., U1 is an inode creation, that was executed
> without a lock and client failed), or U1 was executed, but not committed
> and S0 failed? It seems that OP0 will have to be rolled back, and hence
> OP1 and OP2 cannot be considered globally committed^W^Weverywhere
> stable?

with fixed definition i think it's correct.

> I was more interested in how batching is implemented and, specifically,
> at what moment server can actually remove at entry from an undo log
> (i.e., before or after it sends a batch, etc.), because it looks to me
> that server agreement on what operations are everywhere stable requires,
> in a general case, a two phase commit, or some other atomic commitment
> protocol.

then a bit more words.

I think the following statement is still true: when any operation is being
executed (updates are being executed on target servers), all updates it
depends on are already executed. let's fix server's state at time our updates
begin to execute: S1 is a state on server 1, S2 is a state on server 2,,,,
Sn is a state on server N. due to (2) once all states S1..Sn are committed,
all dependency our updates might have are resolved and they can't be aborted
due to abort of some previous operation.

in practice this mean that having series of updates on some server:
U1, U2, U3, U4,,,,, Un, Un+1 we can choose some N, ask all servers for their
last generated transno (not last committed transno) and assign set of transno
to point N. once all servers have reported corresponded transno committed,
we know that all dependency updates U1..Un might have are resolved and U1..Un
can't be aborted.  (5)

of course, this is true only for operations with number of updates = 1 (iirc,
we call them local operations in contrast with distributed where number of
updates > 1). for distributed operations we also need to make sure all updates
are committed.

when some server commits update and corresponded operation has 2 or more updates,
then server reports this to other servers involved in the operation. in practice,
server doesn't report immediately, instead it put transaction id into some batch
(batches) which will be fired later.  (6)

now back to series updates on server: U1, U2, U3, U4,,,, Un, Un+1. in general,
each update has own undo record in the log. record for any local update at the
beginning of the series can be cancelled once corresponded update is locally
committed. record for any distributed operation's update can be removed from the
series so that it doesn't hold remaining records, but not cancelled.

In order to cancel undo record for a distributed operation we need to make sure
that during recovery none of undo record of this operation can be used, otherwise
recovery can be confused finding record on one server, but not on another one.

this can be done with llog-like protocol: for any distributed operation, server
with minimal id cancel own undo record and generates another record M marking
operation globally committed. then server notifies other servers involved in the
operation, their cancel own undo records, once cancels are committed, record M
can be cancelled.  (7)

now let's consider that example:

      S0   S1   S2   S3
OP0  U1   U2
OP1       U3   U4
OP2            U5   U6

le's redraw it a bit ....

undo series of S0:  U1
undo series of S1:  U2  U3
undo series of S2:      U4  U5
undo series of S3:          U6

S0 reports committness of U1 in transno T01 (OP1) to S1, now S1 knows U2 depends on S0/T01
S1 reports committness of U2 in transno T11 (OP1) to S0, now S0 knows U1 depends on S1/T11
S1 reports committness of U3 in transno T12 (OP2) to S2, now S2 knows U4 depends on S1/T12
S2 reports committness of U4 in transno T21 (OP2) to S1, now S1 knows U3 depends on S2/T21
S2 reports committness of U5 in transno T22 (OP3) to S3, now S3 knows U6 depends on S2/T22
S3 reports committness of U6 in transno T31 (OP3) to S2, now S2 knows U5 depends on S3/T31

now each server knows direct dependency. then all them have to resolve global dependency:
S0 requests current state from S1,S2,S3 - they return last generated transno
S1 requests current state from S0,S2,S3 --//--
S2 requests current state from S0,S1,S3 --//--
S3 requests current state from S0,S1,S2 --//--

at some point all servers report collected transno committed. given all updates belong to
distributed transactions, servers can remove them from series so that they don't hold
dependency for anyone, but not cancel. as noted in (7) we can use llog-like protocol to
cancel undo records for distributed operations. as they don't block any operation we can
postpone cancel for very long to improve bandwidth usage.

I think this *oversimplified* approach demonstrates that we can do "stabilization" with
anywhere-generated-id operations.

messages reporting committness can be batched. we can even use bi-directional protocol
when S0 reporting committness of U1 to S1 gets a reply claiming committness of U2 back.

any message can carry "last generated transno" along with "last committed", making
"request current state" not needed.

One of important advantages such approach has is ability to implement fsync(2)
more optimal way, without involving whole cluster.

The simplest optimization could be to omit requests for other server's state
(see (5)) and undo records, for all local operations if there is undo log is
empty. so, as long as server doesn't execute global operations all local
operations are executed with zero "epoch overhead", like today.

More advanced approach could include propagation of involved servers when they
exchange committness of distributed operations (see (6)). Say, if server S1
has no other distributed operations (thus doesn't depend on other servers), then
reporting commit of update U1 (part of operation O1) to server S2 it tells
dependency of itself on S1,S2. when S2 reports committness of some other operation
O2 to server S3, it tells dependency on S1,S2. now, when S3 resolves global
dependency (see (5)), it doesn't requests state from all the servers, but only
from S1 and S2. We can go further even and include last generated transno along
with server into report. Then other servers don't need to request states even,
just wait till servers have those transno committed.

even more advanced approach could be to track precise dependency for any operation.
this is not very useful for ldiskfs as fsync(2) flushes all pending updates, but
with DMU we could use zlog and flush only really required bits.

thanks, Alex

More information about the lustre-devel mailing list