[Lustre-devel] SOM re-design, overview

Wed Feb 3 08:46:42 PST 2010

I. Introduction.

SOM is split to several simple mechanisms: invalidation, revalidation,
llog cleanup.

1. Invalidation.

Once file is opened for modification, SOM cache is invalidated, this is
a modification of inode EA on MDS, not synchronous so we need to  
protect this
change against MDS failure -- IO to OST will create a llog record.

2. Revalidation.

The main idea is to re-use the basic getattr mechanism, client already  
gather
attributes from both MDS & OST and the only missed bit is to send  
attributes
back to MDS. However, it is done only if needed -- a client asks for  
file
attributes by itself and MDS decides it is time to rebuild the cache,  
tells
the client in reply to do it.

MDS has no in-core state while SOM revalidation, if client is evicted  
or so, there
is no problem as inode is not pinned in the memory. However, we still  
want to mark
this inode as "SOM rebuild is in progress" because we do not want to  
have many
clients performing SOM rebuild in parallel.

If MDS decides to rebuild the cache too early, OST must be able to  
detect it!
Once detected and reported to client in reply on glimpse, client just  
interrupts.

2.1. Revalidation time.

MDS is not notified when clients completes their writes (there is no  
DONE_WRITING
RPC nymore), and IOEpoch closes on file close. By that time clients  
may still have
dirty cache and we do not want to force client to flush it  
immediately, MDS just
waits for SOM_TIMEOUT before starting the revalidation.

SOM_TIMEOUT covers the time client will keep the data in cache, the  
flush and
commit on OST: SOM_TIMEOUT = CLIENT_FLUSH_TIMEOUT + OST_COMMIT_TIMEOUT

Revalidation is done when file is closed, so OST is able to detect the  
revalidation
is too early by existent extent lock or not committed data.

SOM_TIMEOUT may also reflect the MDS idea that we do not need clients  
cache
anymore, and OST will initiate lock cancel when get such a  
notification on glimpse.

2.2. Advantages

- no overhead for too early revalidation, no extra rpc;
- no overhead for revalidation, except the final md_setattr, which can  
be batched;
- no extra in-core state on MDS;
- no extra activity on client nor on MDS;

2.3. Disadvantages

- first client has no SOM cache benefits;

3. LLOG cleanup.

The main idea is to not rely on clients in propagating llog cookies  
from OST to
MDS, but sending them directly from OST to MDS. Indeed:
- client is not informing MDS about write completion and not having  
llog cookies
   by the time of close (although DONE_WRITING could do it);
- client may be evicted from MDS and we get llog record leackage --  
nobody will
   take care of it till the next modification;
- SOM cache revalidation is done on-demand so could be done much later;

OST sends llog records to MDS immediately once the transaction with  
this llog
record is committed, batched to save on the amount of RPC. Once SOM  
invalidation
committed, MDS sends llog cancel back, batched again.

IO comes to OST with IOEpoch# assigned, OST creates a llog record for  
this IOepoch and updates IOEpoch# in inode EA. IOEpoch# in EA tells  
next IO that
llog record for this IOEpoch is already created. LLOG record indicates  
SOM
cache on MDS for this IOEpoch needs to be invalidated -- once  
invalidation commits, llog record is not needed. This way OST has no  
in-core states.

Advantages:
- minimum RPC overhead, evrything is batched;
- quick LLOG cancel;
- no in-core states on OST.

II. Recovery.

1. Client eviction.

After eviction, client can proceeds with its IO to files MDS has  
closed on
eviction. This IO must be blocked right on client, until files are not  
re-opened,
otherwise we will not be able to rebuild SOM cache -- new IO will  
destroy it.

However, client get known about its eviction not immediately and even  
after
that it may already have some RPC in-fligth (it concerns lockless IO  
or extent
lock enqueue only, cached IO under already existent locks is allowed  
as can be
controlled by OST through these extent locks).

One of the possible solutions here is timeouts. MDS closes all the  
opened by the
evicted client files and therefore opened IOEpochs, since close, SOM  
cache cannot
be rebuild for SOM_RECOVERY_TIMEOUT which must ensure:
- the client knows it is evicted;
- the last client's RPC in-flight, sent before it has got known it is  
evicted
  from MDS, would either complete or would be blocked;

At the same time client may have a dirty data, so MDS needs to let it  
to flush
and commit on OST, thus MDS waits for
	SOM_TIMEOUT = max(SOM_RECOVERY_TIMEOUT, CLIENT_FLUSH_TIMEOUT)
                     + OST_COMMIT_TIMEOUT

If RPC in-flight comes to OST too late, client is already reconnected,  
OST
skips RPC from previous connections, but client resends it. This  
resend can
be blocked on the client (if the client detects it is evicted from MDS  
in the
meanwhile).

2. MDS failover.

MDS waits for SOM_TIMEOUT after the bootup before starting to rebuild  
SOM
cache globally. Reasons are the same as for client eviction, as some  
client
may be evicted over the failover as well; globally because MDS does  
not have
a list of files involved.

Of course, SOM is disabled for a file if some OST in its stripeset has  
not
synchronised LLOGs with MDS yet.

--
Vitaly