[Lustre-devel] SOM Recovery of open files

Fri Mar 13 08:32:00 PDT 2009

Hi All,

this is the summary of our following discussion with Andreas from Feb24,
it includes the problem use cases of SOM recovery and interoperability,
describes problems of each use case and suggests possible solutions.

UseCase1. Client is evicted from MDS and re-connects.
UseCase2. MDS failover.
UseCase3. Client eviction and following MDS failover.

Problem1 File opened for write is not re-opened;
Problem2. File opened (precisely, with IOEpoch opened) for truncate is  
not re-opened;
Problem3. Client is able to write (new syscall) to a not re-opened file
(MDS has no control over IO happened in the cluster).
Problem4. Client is able to flush dirty data to OST for a not re- 
opened file.
Problem5. Client is able to re-send a write RPC to OST for a not re- 
opened file.

Solution1: New OPEN rpc on recovery.
Problem1.1: does not work for client eviction so far, when no MDS  
recovery is involved.
Problem1.2. even if recovery is involved but the client is already  
evicted, it does not work.
Problem1.2: does not work for truncate. The situation is pretty rare  
as client does not
cache punches, but what if at the time of the client eviction from  
MDS, the connection
between this client and an OST is unstable so that punches will hang  
in the re-send list
for a while, enough for another client to modify the file -- MDS gets  
a new SOM cache,
and later punch will modify the file.

Solution2: LOV EA lock, client blocks new IO if absent.
Problem2.1: LOV EA works for new syscalls only, not for  
(Problem4,Problem5).

Solution3: SOM cache is removed upon client eviction for all the  
opened IOEpochs
Problem3.1. works until SOM cache is re-validated before some later  
IO  happens
from a lost client.

Solution4: client's dirty cache is controlled from OST through extent  
locks.
MDS removes SOM cache for inode on client eviction, next file writer  
sees there is no
cache on MDS upon file open, thus the cache is re-obtained under  
[0;EOF] extent lock,
what flushes all the data on OST.
Problem4.1: Lockless IO (write,truncate) is not handled this way. rpc  
may be sitting in
the re-send list enough for another client to modify the file, SOM  
cache is re-obtained
to MDS, and delayed write/punch makes it invalid.
Problem4.2: Locked truncate is not handled this way. Enqueue may be  
sitting in re-send
list similar to 4.1.

Solution5: Cluster flush on MDS-OST synchronization.
SOM is disabled for a file until all the OSTs from its stripe are  
synchronized with MDS;
Synchronization includes: all the clients flushes their dirty cache to  
OST, llog cookie is
sent to MDS, MDS removes SOM cache for files involved. It means, a new  
IOEpoch opens
but cache is not re-validated at the end; getattr does not obtain SOM  
cache.
Problem5.1. Lockless IO (write,truncate) is not handled this way.
Problem5.2. Locked truncate is not handled this way (see Problem4.2)

UseCase4. Upgrade to SOM-enabled Lustre.
All the above problems exists.

Problem6. No IOEpoch has been opened (however, the SOM cache is  
removed synchronously
on open) truncate does not close "opened" file at all, i.e. MDS has no  
control over IO happened
in the cluster and later punch may destroy the SOM cache on MDS.

Solution6. Send done_writing even if SOM is disabled.

UseCase5. MDS fails over, a client has dirty cache but does not  
participate in the recovery.

Solution7. Invalidate SOM cache on MDS on close.
I.e. instead of blocking IO on the client, remove SOM cache in advance.
Problem7.1. the solution does not work, thus we cannot depend on  
client as it may be evicted
(UseCase5) if close is not committed yet, lockless IO still may happen  
with a delay..

Solution8. Evict client from OST once client is evicted from MDS (via  
MDS->OSS connection and
set_info(KEY_EVICT_BY_NID)). Or cancel this client extent locks only.  
Therefore, prevent any IO
happen since then.
Problem8.1. if 2 mds failovers happen right one after another, it  
seems mds is already not able to tell
which clients are lost over failover -- after the first failover it  
lets the clients to re-connect and
overwrites the previous info about connected clients but does not  
succeeds to tell ost to evict
the client -- and here the 2nd mds failure happens.
This solution could be probably done in some different way -- (a)  
client itself informs OST it is
evicted; (b) MDS provides the full list of connected clients to OST on  
boot and then informs OSTs
about client evictions;

Solution9. Invalidate SOM cache on open for write.
Problem1. Unless written synchronously, MDS may fail before open gets  
committed.
Problem2. Even if committed, a new IOEPOCH may re-validate SOM cache  
which could become
wrong due to a later lockless IO reached OST or a such.

Still missed solutions:
(*) block truncate and lockless IO (either it is a new syscall, an  
enqueue rpc is in re-send
list; lockless IO (write,truncate) is in the re-send list) until the  
connection to MDS is restored.
Problem: a possible race between time MDS restarts and client detects  
it is evicted. I.e. client
may continue to send IO to OST but it must detect the time (and block  
its IO until file is re-opened)
when MDS is up, MDS-OST synchronization is completed;

Solution10. Cause the OST to get an ioepoch for the file and  
invalidate the SOM cache on the
MDS by itself BEFORE allowing the lockless operation to complete.

Solution11. Client could wait for an open+SOM invalidate to commit  
before sending the lockless
operation.

Solution12. OST may send a new RPC to all the clients once MDS-OST  
synchronization starts.
Clients re-validate its connection to MDS and re-opens files on MDS is  
so, client blocks its IO
to OST until done. If client re-connects to OST, it must re-validate  
MDS connection as well
right before that.

--
Vitaly