[Lustre-devel] Recovering opens by reconstruction
Mikhail Pershin
Mikhail.Pershin at Sun.COM
Tue Jul 7 06:56:36 PDT 2009
On Mon, 06 Jul 2009 21:34:41 +0400, Nicolas Williams
<Nicolas.Williams at sun.com> wrote:
>
> In my proposal what would happen is that opens would only be recovered
> by _replay_ when the transaction had not yet been committed, otherwise
> the opens will be recovered by making a _new_ (non-replay) open RPC.
>
Yes, I understood that and agree that this looks like more clean
implementation but I see the following problems so far:
- two kinds of client - new and old that should be handled somehow
- client code should be changed a lot
- server need to understand and handle this too
What will we get for this? Sorry for my annoyance, but it looks for me
that it can be solved in simpler ways. E.g. you can add MGS_OPEN_REPLAY
flag to such requests, so it will be also different in wire from
transaction replays. Or we could re-use lock replay functionality somehow.
The locks are not kept as saved RPC too but enqueued as new requests. The
open is very close to this, I agree with idea that open handle has all
needed info and no need to keep original RPC in this case.
I mean that proposed solution looks overcomplicated just to solve
signature problem though it makes sense in general. If we are going to
re-organize open recovery and have time for this it would be better to
move it from context of replay signature to separate task as it is quite
complex.
>
> I'm not sure why a new stage would necessarily slow recovery in a
> significant way. The new stage would not involve any writes to disk
> (though it would involve reads, reads which could then be cached and
> benefit the transaction recovery phase).
Not necessarily, but it can. It is not about open stage only, it is about
the whole approach to do recovery by stages when all clients must wait for
any other at each stage before they can continue recovery. We have already
this in HEAD and it extends recovery window. Lustre 1.8 had only single
timer for recovery, Lustre 2.0 has 3 stages and timer should be re-set
after each one. If all clients are alive than the recovery time will be
mostly the same, but if clients may gone during recovery then lustre 2.0
recovery time can be three times longer already. Just imagine that at each
stage one client is gone, then at each stage all clients will wait until
timer expiration. And the bigger cluster we have the more clients can be
lost during recovery so recovery time may differ significantly.
Also this means that server load is not well distributed over recovery
time. It waits then start doing all requests at once then waits again on
other stage, etc.
Another point here is the possible using the version recovery instead of
transaction-based recovery. This will makes recovery based on versions of
object and it makes just no sense to wait all clients at each recovery
stage, because all dependencies should be clear from versions and clients
may finish recovery independently. Currently the requests can be recovered
by versions and there is work on lock replays using versions too.
>
> Suppose we recovered opens after transactions: we'd still have
> additional costs for last unlinks since we'd have to put the object on
> an on-disk queue of orpahsn until all open state is recovered. See
> above.
There is no additional cost for pair of open-unlink because orphan is
needed anyway after unlink. The only exception is replay of pure unlink.
But we need to keep orphans after unlinks for other cases anyway, e.g.
delayed recovery and such overhead is nothing compared with time that can
be lost on waiting for everyone as described above.
In fact this is already slightly out of scope original idea about open
replay organization. This is more related to server recovery handling,
version recovery, delayed recovery and can be discussed later when open
replay changes on client will be settled, it will be more clear in that
time.
--
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.
More information about the lustre-devel
mailing list