[Lustre-devel] SOM Recovery of open files
Vitaly.Fertman at Sun.COM
Fri Mar 13 08:32:00 PDT 2009
this is the summary of our following discussion with Andreas from Feb24,
it includes the problem use cases of SOM recovery and interoperability,
describes problems of each use case and suggests possible solutions.
UseCase1. Client is evicted from MDS and re-connects.
UseCase2. MDS failover.
UseCase3. Client eviction and following MDS failover.
Problem1 File opened for write is not re-opened;
Problem2. File opened (precisely, with IOEpoch opened) for truncate is
Problem3. Client is able to write (new syscall) to a not re-opened file
(MDS has no control over IO happened in the cluster).
Problem4. Client is able to flush dirty data to OST for a not re-
Problem5. Client is able to re-send a write RPC to OST for a not re-
Solution1: New OPEN rpc on recovery.
Problem1.1: does not work for client eviction so far, when no MDS
recovery is involved.
Problem1.2. even if recovery is involved but the client is already
evicted, it does not work.
Problem1.2: does not work for truncate. The situation is pretty rare
as client does not
cache punches, but what if at the time of the client eviction from
MDS, the connection
between this client and an OST is unstable so that punches will hang
in the re-send list
for a while, enough for another client to modify the file -- MDS gets
a new SOM cache,
and later punch will modify the file.
Solution2: LOV EA lock, client blocks new IO if absent.
Problem2.1: LOV EA works for new syscalls only, not for
Solution3: SOM cache is removed upon client eviction for all the
Problem3.1. works until SOM cache is re-validated before some later
from a lost client.
Solution4: client's dirty cache is controlled from OST through extent
MDS removes SOM cache for inode on client eviction, next file writer
sees there is no
cache on MDS upon file open, thus the cache is re-obtained under
[0;EOF] extent lock,
what flushes all the data on OST.
Problem4.1: Lockless IO (write,truncate) is not handled this way. rpc
may be sitting in
the re-send list enough for another client to modify the file, SOM
cache is re-obtained
to MDS, and delayed write/punch makes it invalid.
Problem4.2: Locked truncate is not handled this way. Enqueue may be
sitting in re-send
list similar to 4.1.
Solution5: Cluster flush on MDS-OST synchronization.
SOM is disabled for a file until all the OSTs from its stripe are
synchronized with MDS;
Synchronization includes: all the clients flushes their dirty cache to
OST, llog cookie is
sent to MDS, MDS removes SOM cache for files involved. It means, a new
but cache is not re-validated at the end; getattr does not obtain SOM
Problem5.1. Lockless IO (write,truncate) is not handled this way.
Problem5.2. Locked truncate is not handled this way (see Problem4.2)
UseCase4. Upgrade to SOM-enabled Lustre.
All the above problems exists.
Problem6. No IOEpoch has been opened (however, the SOM cache is
on open) truncate does not close "opened" file at all, i.e. MDS has no
control over IO happened
in the cluster and later punch may destroy the SOM cache on MDS.
Solution6. Send done_writing even if SOM is disabled.
UseCase5. MDS fails over, a client has dirty cache but does not
participate in the recovery.
Solution7. Invalidate SOM cache on MDS on close.
I.e. instead of blocking IO on the client, remove SOM cache in advance.
Problem7.1. the solution does not work, thus we cannot depend on
client as it may be evicted
(UseCase5) if close is not committed yet, lockless IO still may happen
with a delay..
Solution8. Evict client from OST once client is evicted from MDS (via
MDS->OSS connection and
set_info(KEY_EVICT_BY_NID)). Or cancel this client extent locks only.
Therefore, prevent any IO
happen since then.
Problem8.1. if 2 mds failovers happen right one after another, it
seems mds is already not able to tell
which clients are lost over failover -- after the first failover it
lets the clients to re-connect and
overwrites the previous info about connected clients but does not
succeeds to tell ost to evict
the client -- and here the 2nd mds failure happens.
This solution could be probably done in some different way -- (a)
client itself informs OST it is
evicted; (b) MDS provides the full list of connected clients to OST on
boot and then informs OSTs
about client evictions;
Solution9. Invalidate SOM cache on open for write.
Problem1. Unless written synchronously, MDS may fail before open gets
Problem2. Even if committed, a new IOEPOCH may re-validate SOM cache
which could become
wrong due to a later lockless IO reached OST or a such.
Still missed solutions:
(*) block truncate and lockless IO (either it is a new syscall, an
enqueue rpc is in re-send
list; lockless IO (write,truncate) is in the re-send list) until the
connection to MDS is restored.
Problem: a possible race between time MDS restarts and client detects
it is evicted. I.e. client
may continue to send IO to OST but it must detect the time (and block
its IO until file is re-opened)
when MDS is up, MDS-OST synchronization is completed;
Solution10. Cause the OST to get an ioepoch for the file and
invalidate the SOM cache on the
MDS by itself BEFORE allowing the lockless operation to complete.
Solution11. Client could wait for an open+SOM invalidate to commit
before sending the lockless
Solution12. OST may send a new RPC to all the clients once MDS-OST
Clients re-validate its connection to MDS and re-opens files on MDS is
so, client blocks its IO
to OST until done. If client re-connects to OST, it must re-validate
MDS connection as well
right before that.
More information about the lustre-devel