[Lustre-devel] HSM cache-miss locking

Thu Oct 9 12:11:10 PDT 2008

Andreas Dilger wrote:
> Nathan,
> Eric and I had a lengthy discussion today about HSM and the copy-in
> process.  This was largely driven by Braam's assertion that having a
> copy-in process that blocks all access to the file data is not sufficient
> to meet customer demands.  Some customers require processes be able to
> access the file data as soon as it is present in the objects.
>
> Eric and I both agreed that we want to start with as simple an HSM solution
> as possible and incrementally provide improvements, so long as the early
> implementation is not a "throw-away" that consumes significant developer
> resources but doesn't provide long term benefits.  In both the "simple"
> and the "complex" copy-in the client has no knowledge/participation
> of the process being done by the HSM/coordinator.
>
> We both agreed that the simplest copy-in process is a reasonable starting
> point and can be used by many customers.  To review the simple case
> (I hope this also matches your recollection):
>
> 1) client tries to access a file that has been purged
>   a) if client is only doing getattr, attributes can be returned from MDS
>     - MDS holds file size[*]
>     - client may get MDS attribute read locks, but not layout lock
>     -> DONE
>   b) if client is trying to do an open (read or write)
>     - layout lock is required by client to do any read/write of the file
>     - client enqueues layout lock request
>     - MDS notices that file is purged, does upcall to coordinator to
>       start copy-in on FID N
>   
s/does upcall/asks/.  We expect coordinator to be in-kernel for LNET 
comms to agents.
> 2) client is blocked waiting for layout lock
>   - if MDS crashes at this point, client will resend open to MDS, goto 1b
>   - MDS should send early replies indicating lock will be slow to grant
>   
The reply to the layout lock request includes a "wait forever" flag 
(this is the one client code change required for HSM at this point.)  
There are no early replies for lock enqueue requests.  Maybe indefinite 
ongoing early replies for lock enqueus are a requirement for HSM copyin?
>   ? need to have a mechanism to ensure copy-in hasn't failed?
>   
Coordinator needs to decide is copy-in has failed, and redistribute the 
request to a new agent. (Needs detail: timer?  progress messages from 
agent?) There's nothing the client or MDT can do at this point (except 
fail the request), so we may as well just wait.
> 3) coordinator contacts agent(s) to retrieve FID N from HSM
>   - agent(s) create temp file X (new or backed-up layout parameters) [!]
>   
backed up in EA with original copyout request.  We should try to respect 
specific layout settings (pool, stripecount, stripesize), but be 
flexible if e.g. pool doesn't exist anymore.  Maybe we want to ignore 
offset and/or specific ost allocations in order to rebalance.
>   - agent(s) restore data into temp file
>   - agent or coordinator do ioctl on file to move file X objects to
>     file N, old objects are destroyed on file close, or 
>   - agent or coordinator do ioctl on file to notify MDS copy-in is done
>   
I was thinking the latter, and MDT moves the layout from X to N.
> 4) MDS handles ioctl, drops layout lock
> 5) client(s) waiting on layout lock are granted the layout lock by MDS
>   - client(s) get OST extent locks
>   - client(s) read/write file data
>   -> DONE
>
> [*] The MDS will already store the file size today, even without SOM, if
>     the file does not have any objects/striping.  If SOM is not implemented
>     then the "purged" state and object removal (with destroy llog entries)
>     would need to be a synchronous operation BEFORE the objects are actually
>     destroyed.  Otherwise, SOM-like recovery of the object purge state is
>     needed.  Avoiding the sync is desirable, but making HSM dependent upon
>     SOM is undesirable.
>   
All we really have to do is insure that the destroy llog entry is 
committed, right?  Then the OSTs should eventually purge the objects 
during orphan recovery, yes?
> [!] If MDS kept layout then it could pre-create the temp file and pass the
>     restore-to FID to the coordinator/agent, to keep agent more similar to
>     "complex" case where it is restoring directly into real file.  The only
>     reason the agent is restoring into the temp file is to avoid needing
>     to open the file while the MDS is blocking layout lock access, but maybe
>     that isn't a big obstacle (e.g. open flag).
>   
You mean open flag O_IGNORE_LAYOUT_LOCK?  So the one problem I see with 
this is the case of a stuck agent - if we want to start another agent 
doing copyin we have to insure that the first agent doesn't try to write 
anything else.  Or we give them two separate temp files, but this 
remains a problem with the direct restore into real file case.  Although 
I suppose this is already handled by write extent locks and eviction...
> In the "complex" case, the clients should be able to read/write the file
> data as soon as possible and the OSTs need to prevent access to the parts
> of the file which have not yet been restored.
>
> 1) client tries to access a file that has been purged
>   a) if client is only doing getattr, attributes can be returned from MDS
>     - MDS holds file size[*]
>     - client may get MDS attribute read locks, but not layout lock
>     -> DONE
>   b) if client is trying to do an open (read or write)
>     - layout lock is required by client to do any read/write of the file
>     - client enqueues layout lock request
>   
           -  MDT generates new layout based on old lov EA, assigning 
newly created OST objects.
>     - MDS grants layout lock to client
> 2) client enqueues extent lock on OST
>     - object was previously marked fully/partly invalid during purge
>     - object may have persistent invalid map of extent(s) that indicate
>       which parts of object require copy-in
>   
I'll read this as if you're proposing your 2,3 (call it "per-object 
invalid ranges held on OSTs") as a new method to do the copyin 
in-place.  This is not the original in-place idea proposed in Menlo Park 
(see below), and so I'll comment with an eye toward the differences.

I think we can't assume we're restoring back to the original OSTs. 
Therefore the MDT must create new empty objects on the OSTs and have the 
OSTs mark them purged before the layout lock can be granted to the clients.
>     - access to invalid parts of object trigger copy-in upcall to coordinator
>   
Now we need to figure out how to map the object back to a particular 
range extent of a particular file (are we storing this in an EA with 
each object now?)  We also need to initiate OST->coordinator 
communication, so either coordinator becomes a distributed function on 
the OSTs or we need new services going the reverse of the normal 
mdt->ost direction.  Maybe the coordinator-as-distributed-function works 
- the coordinators must all choose the same agent for objects belonging 
to the same file, yet distribute load among agents: I think the 
coordinator just got a lot more complicated.
>     ? group locks on invalid part of file block writes to missing data
>   
The issue here is that we can't allow any client to write and then have 
the agent overwrite the new data with old data being restored.  So we 
could have the OST give a group lock to agent via coordinator, 
preventing all other writes.  But it seems that we can check the special 
"clear invalid" flag used by the agent (see (3) below), and silently 
drop writes into areas not in the "invalid extents" list.  Any client 
write to any extent will clear the invalid flag for those extents.  And 
then we only ever need to block on reading.
What about reads to missing data?  OST refuses to grant read locks on 
invalid extents, needs clients to wait forever.
>     - clients block waiting on extent locks for invalid parts of objects
>   
We'll have to set this extent lock enqueue timeout to wait forever.
>     - OST crash at this timek restarts enqueue process
>   
Agent crash will still have to be detected and restarted by coordinator

> 3) coordinator contacts agent(s) to retrieve FID N from HSM
>     - agents write to actual object to be restored with "clear invalid" flag
>     - writes by agent shrink invalid extent, periodically update on-disk
>       invalid extent and release locks on that part of file (on commit?)
>   
The OST should keep track of all invalid extents.  Invalid extents list 
changes should be stored on disk, transactionally with the data write.
>     - client or agent agent crash doesn't want to access parts of multi-
>       part archive it will
>   
??

Invalid extents list will be accurate regardless of client, agent, or 
OST crash.  I hope.  Subsequent requests to missing data will result in 
new OST requests to coordinator.
> 4) client is granted extent lock when that part of file is copied in
>   

So that actually doesn't sound too bad.  I think the original idea of 
keeping the locking (and the coordinator) on the MDT (below) is still 
simpler, but I think it's going to be the recovery issues that decide 
this one way or the other.

Original in-place copyin idea:
When MDT generates new layout, it takes PW write locks on all extents of 
every stripe on behalf of the agent, and then somehow transfers these 
locks to the agent (this transferability was the point of using the 
group lock).  The agent then releases extent locks as it copies in data.
This was the first design we discussed in Menlo Park:

    (older idea, for posterity)
    Open intent enques layout lock.  MDT checks "purged" bit; if purged,
    MDT selects new layout and populates MD.  MDT takes group extent
    locks on all objects, then grants layout read lock to client,
    allowing open to finish successfully, quickly.  (Client reads/writes
    will block forever on extents enqueues until group lock has been
    dropped.)  MDT then sends request to coordinator requesting copyin
    FID XXXX with group lock id YYYY (and extents 0-end).  Coordinator
    distributes that request to an appropriate agent.  Agent retrieves
    file from HSM and writes into /.lustre/fid/XXXX:XXXX using group
    lock YYYY.  Agent takes group lock, MDT still holds group lock. 
    When finished, the agent clears "purged" bit from EA, and drops the
    group lock.  Clearing purged bit causes MDT to drop group lock as
    well, allowing the client to read/write.

It gets fuzzy at the end there, about exactly when the MDT drops the 
group lock in order to do handle the dead agent case.  It seems the safe 
thing to do is for the MDT to keep it until the agent is done, but then 
this blocks access to completed extents.  If the MDT drops the group 
lock as soon as the agent takes it, then somehow the agent converts the 
group lock to regular write lock, then other clients can get read/write 
locks on released extents.  But if the agent dies, the extent locks will 
be freed at eviction, and other clients are free to start reading 
(missing) data.