[Lustre-devel] Lustre client disk cache (fscache)

Mon Jan 5 10:32:23 PST 2009

Andreas, thanks for the thoughtful reply, and sorry for being so slow to
acknowledge and respond to it.   Responses are below.

On Fri, Nov 14, 2008 at 6:00 PM, Andreas Dilger <adilger at sun.com> wrote:

> On Nov 11, 2008  13:23 -0600, John Groves wrote:
> > This work is primarily motivated by the need to improve the performance
> > of lustre clients as SMB servers to windows nodes.  As I understand it,
> > this need is primarily for file readers.
> >
> > Requirements
> >
> > 1. Enabling fscache should be a mount option, and there should be ioctl
> >    support for enabling, disabling and querying a file's fscache usage.
>
> For Lustre there should also be the ability to do this via /proc/fs/lustre
> tunables/stats.

Makes sense, thanks.

>
>
> > 2. Data read into the page cache will be asynchronously copied to the
> >    disk-based fscache upon arrival.
> > 3. If requested data is not present in the page cache, it will be
> retrieved
> >    preferentially from the fscache.  If not present in the fscache, data
> >    will be read via RPC.
> > 4. When pages are reclaimed due to memory pressure, they should remain in
> >    the fscache.
> > 5. When a user writes a page (if we support fscache for non-read-only
> > opens),
> >    the corresponding fscache page must either be invalidated or
> >    (more likely) rewritten.
> > 6. When a DLM lock is revoked, the entire extent of the lock must be
> >    dropped from the fscache (in addition to dropping any page cache
> >    resident pages) - regardless of whether any pages are currently
> resident
> >    in the page cache.
> > 7. As sort-of a corollary to #6, DLM locks must not be canceled by the
> owner
> >    as long as pages are resident in the fscache, even if memory pressure
> >    reclamation has emptied the page cache for a given file.
> > 8. Utilities and test programs will be needed, of course.
> > 9. The fscache must be cleared upon mount or dismount.
>
> > High Level Design Points
> >
> > The following is written based primarily on review of the 1.6.5.1 code.
> > I'm aware that this is not the place for new development, but it was
> > deemed a stable place for initial experimentation.
>
> Note that the client IO code was substantially re-written for the 2.0
> release.  The client IO code from 1.6.5 is still present through the
> 1.8.x releases.

Understood.

> > Req.    Notes
> >
> >  1.    In current Redhat distributions, fscache is included and
> >     NFS includes fscache support, enabled by a mount option.
> >     We don't see any problems with doing something similar.
> >     A per-file ioctl to enable/disable fscache usage is also seen
> >     as straightforward.
> >
> >  2.     When an RPC read (into the page cache) completes, in the
> >     ll_ap_completion() function, an asynchronous read to the
> >     same offset in the file's fscache object will be initiated.
> >     This should not materially impact access time (think dirty page
> >     to fscache filesystem).
>
>
> Do you mean an "asynchronous write to the ... fscache object"?

Yes - write it is.

> >  3.     When the readpage method is invoked because a page is not
> >     already resident in the page cache, the page will be read
> >     first from the fscache.  This is non-blocking and (presumably)
> >     fast for the non-resident case.  If available, the fscache
> >     read will proceed asynchronously, after which the page will be
> >     valid in the page cache.  If not available in the fscache,
> >     the RPC read will proceed normally.
> >
> >  4.     Page removal due to memory pressure is triggered by a call to
> >     the llap_shrink_cache function.  This function should not require
> >     any material change, since pages can be removed from the page
> >     cache without removal from the fscache in this case.  In fact,
> >     if this doesn't happen, the fscache will never be read.
> >     (note: test coverage will be important here)
> >
> >  5.    It may be reasonable in early code to enable fscache only
> >     for read-only opens.  However, we don't see any inherent problems
> >     with running an asynchronous write to the fscache concurrently
> >     with a Lustre RPC write.  Note that this approach would *never*
> >     have dirty pages exist only in the fscache; if it's dirty it
> >     stays in the page cache until it's written via RPC (or RPC
> >     AND fscache if we're writing to both places)..
>
>
> This is dangerous from the point of view that the write to the fscache
> may succeed, but the RPC may fail for a number of reasons (e.g. client
> eviction) so it would seem that the write to the fscache cannot start
> until the RPC completes successfully.

Good catch, thanks.

> >  6 & 7    This is where it gets a little more tedious.  Let me revert to
> >     paragraph form to address these cases below.
> >
> >  8    Testing will require the following:
> >     * ability to query and punch holes in the page cache (already done).
> >     * ability to query and punch holes in the fscache (nearly done).
> >
> >  9  I presume that all locks are canceled when a client dismounts
> >     a filesystem, in which case it would never be safe to use data
> >     in the fscache from a prior mount.
>
>
> A potential future improvement in the second generation of this feature
> might be the ability to revalidate the files in the local disk cache by
> the MDT and OST object versions, if those are also stored in fscache.

Cool idea.

> > Lock Revocation
> >
> > Please apply that "it looks to me like this is how things work" filter
> here;
> > I am still pretty new to Lustre (thanks).  My questions are summarized
> > after the the text of this section.
> >
> > As of 1.6.5.1, DLM locks keep a list of page-cached pages
> > (lock->l_extents_list contains osc_async_page structs for all currently
> > cached pages - and I think the word extent is used both for each page
> cached
> > under a lock, and to describe a locked region...is this right?).  If a
> lock
> > is revoked, that list is torn down and the pages are freed.  Pages are
> also
> > removed from that list when they are freed due to memory pressure, making
> > that list sparse with regard to the actual region of the lock.
> >
> > Adding fscache, there will be zero or more page-cache pages in the extent
> > list, as well as zero or more pages in the file object in the fscache.
> > The primary question, then, is whether a lock will remain valid (i.e. not
> be
> > voluntarily released) if all of the page-cache pages are freed for
> > non-lock-related reasons (see question 3 below).
>
>
> Yes, the lock can remain valid on the client even when no pages are
> protected by the lock.  However, locks with few pages are more likely
> to be cancelled by the DLM LRU because the cost of re-fetching those
> locks is much smaller compared to locks covering lots of data.  The
> lock "weight" function would need to be enhanced to include pages that
> are in fscache instead of just those in memory.

Got it, thanks.  That would have eluded me...

> > The way I foresee cleaning up the fscache is by looking at the overall
> > extent of the lock (at release or revocation time), and punching a
> > lock-extent-sized hole in the fscache object prior to looping through
> > the page list (possibly in cache_remove_lock() prior to calling
> > cache_remove_extents_from_lock()).
>

FYI it turns out that fscache doesn't have the ability to punch a hole.  The
whole file has to be dropped at present.

>
> >
> > However, that would require finding the inode, which (AFAICS) is not
> > available in that context (ironically, unless the l_extents_list is non-
> > empty, in which case the inode can be found via any of the page structs
> in
> > the list).  I have put in a hack to solve this, but see question 6 below.
>
>
> Actually, each lock has a back-pointer to the inode that is referencing
> it, in l_ast_data, so that lock_cancel->mapping->page_removal can work.
> Use ll_inode_from_lock() for this.

That's much nicer than my hack...thanks.

> > Summarized questions:
> > Q1: Where can I read up on the unit testing infrastructure for Lustre?
>
>
> There is an internal wiki page with some information on this, it should
> probably be moved to the public wiki.

If there's a way to let me know when that happens, I'd appreciate it.  I'm
not a full time lustre-devel reader (at least currently).

> > Q2: Is stale cache already covered by existing unit tests?
>
>
> I'm not sure what you mean.  There is no such thing as stale cache in
> Lustre.

What I was driving at is a test to verify that any page cache data was
discarded when a lock was revoked.  The same test would catch failure to
discard fscache data, that being a potentially stale place to reload the
page cache from.  Perhaps that's implicitly covered somehow.

> > Q3: Will a DLM lock remain valid (i.e. not be canceled) even if its page
> >     list is empty (i.e. all pages have been freed due to memory
> pressure)?
>
>
> Yes, though the reverse is impossible.
>
> > Q4: Will there *always* be a call to cache_remove_lock() when a lock is
> >     canceled or revoked?  (i.e. is this the place to punch a hole in the
> >     fscache object?)
> > Q5: for the purpose of punching a hole in a cache object upon lock
> >     revocation, can I rely on the lock->l_req_extent structure as the
> >     actual extent of the lock?
>
> No, there are two different extent ranges on each lock.  The requested
> extent, and the granted extent.  The requested extent is the minimum
> extent size that the server could possibly grant to the client to finish
> the operation (e.g. large enough to handle a single read or write syscall).
> The server may decide to grant a larger lock if the resource (object) is
> not contended.
>
> In the current implementation, the DLM will always grant a full-file lock
> to the first client that requests it, because the most common application
> case is that only a single client is accessing the file.  This avoids any
> future lock requests for this file in the majority of cases.

Thanks.  Given that fscache invalidation turns out to be full-file anyway,
this becomes moot for the time being.

> > Q6: a) is there a way to find the inode that I've missed?, and
> >     b) if not what is the preferred way of giving that function a way to
> >     find the inode?
>
>
> See above.
>
> > FYI we have done some experimenting and we have the read path in a
> > demonstrable state, including crude code to effect lock revocation on the
> > fscache contents.  The NFS code modularized the fscache hooks pretty
> nicely,
> > and we have followed that example.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
>

Thanks again!
John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20090105/1c5947d6/attachment.htm>