[Lustre-devel] "Simple" HSM straw man
Nathan.Rutman at Sun.COM
Thu Oct 9 15:37:21 PDT 2008
This is intended to be a starting point for discussion; the concepts
here have been hashed through a few times and hopefully represent the
best current thinking.
1. all single-file coherency issues are in kernel space (file locking,
2. all policy decisions are in user space (using changelogs, df, etc)
3. coordinator/mover communication will use LNET
4. "simple" refers to
a. integration with HPSS only
b. depends on changelog for policy decisions
c. restore on file open, not data read/write
5. HSM tracks entire files, not stripe objects
6. HSM namespace is flat, all files are addressed by FID only
7. Desired: coordinator and movers can be reused by (non-HSM) replication
a. combined kernel (LNET comms) and userspace processes
b. userspace processes will use Lustre clients for data i/o
c. will use special fid directory for file access (.lustre/fid/XXXX)
d. interfaces with hardware-specific copy tool to access HSM files
e. kernel process encompasses service threads listening for
coordinator requests, passes these up to userspace process via upcall.
No interaction with the client is needed; this is a simple message
a. decides and dispatches copyin and copyout requests to movers
b. consolidates repeat requests
c. re-queues requests to a new agent if an agent becomes unresponsive
d. kernel space, associated with the MDT for cache-miss
e. ioctl interface for copyout, purge requests from policy engine
3. Policy engine (aka Space Manager)
a. makes policy decisions for copyout, purge
b. normally uses changelogs and 'df' for input; rarely is allowed to
c. userspace process, requests copyout and purge via ioctl to
4. MDT changes
a. Per-file layout lock
A new layout lock is created for every file. Private writer lock is
taken by the MDT when allocating/changing file layout (LOV EA).
Shared reader locks are taken by anyone reading the layout (client
opens, lfs getstripe). Anyone taking a new extent lock anywhere in
the file must first hold the layout lock.
Problem: Layout lock can't be held by liblustre during i/o?
b. lov EA changes
i. flags: file_is_purged "purged", copyout_begin,
file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete. The purged
flag is always manipulated under a write layout lock, the other flags
ii: "window" EA range of non-purged data (rev2)
c. new file ioctls: HSMCopyOut, HSMPurge, HSMCopyinDone
a. Policy engine decides to copy a file to HSM, executes HSMCopyOut
ioctl on file
b. ioctl handled by MDT, which passes request to Coordinator
c. coordinator dispatches request to mover. request should include
file extents (for future purposes)
d. normal extents read lock is taken by mover running on client
e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA.
f. any writes to the file set the "hsm_dirty" bit (may be
lazy/delayed with mtime or filesize change updates on MDT). Note that
file writes need not cancel copyout; for a fs with a single big file, we
don't want to keep interrupting copyout or it will never finish.
g. when done, mover checks hsm_dirty bit. If set, clears
copyout_begin, indicating current file is not in HSM. If not set,
mover sets "copyout_complete" bit. File layout write lock is not taken
during mover flag manipulation. (Note: file modifications after copyout
is complete will have both copyout_complete and hsm_dirty bits set.)
2. purge (aka punch)
a. Policy engine decides to purge a file, exectues HSMPurge ioctl on
b. ioctl handled by MDT
c. MDT takes a write lock on the file layout lock
d. MDT enques write locks on all extents of the file. After these
are granted, then no client has any dirty cache and no child can take
new extent locks until layout lock is released. MDT drops all extent locks.
e. MDT verifies that hsm_dirty bit is clear and copyout_complete bit
f. MDT marks the LOV EA as "purged"
g. MDT sends destroys the OST objects, using destroy llog entries to
guard against object leakage during OST failover
h. MDT drops layout lock.
3. restore (aka copyin aka cache miss)
a. Client open intent enques layout read lock.
b. MDT checks "purged" bit; if purged, lock request response
includes "wait forever" flag, causing client to block the open.
c. MDT creates a new layout with a similar stripe pattern as the
original, allocating new objects on new OSTs. (We should try to respect
specific layout settings (pool, stripecount, stripesize), but be
flexible if e.g. pool doesn't exist anymore. Maybe we want to ignore
offset and/or specific ost allocations in order to rebalance.)
d. MDT sends request to coordinator requesting copyin of the file to
.lustre/fid/XXXX with extents 0-EOF. Extents may be used in the future
to (a) copy in part of a file, in low-disk-space situations; (b) copy in
individual stripes simultaneously on multiple OSTs.
e. Coordinator distributes that request to an appropriate mover.
f. Writes into .lustre/fid/* are not required to hold layout read
lock (or special flag is passed to open, or group write lock on layout
is passed to mover)
g. Mover copies data from HSM
h. When finished, mover calls ioctl HSM_COPYIN_DONE on the file
i. MDT clears "purged" bit from LOV EA
j. MDT releases the layout lock
k. This sends a completion AST to the original client, who now
completes his open.
TBD - I think there's enough in here to chew on for awhile
Things requiring a more detailed look
1. configuration of HSM/movers
2. policy engine
3. "complex" HSM roadmap
a. partial access to files during restore
b. partial purging for file type identification, image thumbnails, ??
c. integration with other HSM backends (ADM, ??)
4. layout locks and liblustre
More information about the lustre-devel