Andreas, thanks for the thoughtful reply, and sorry for being so slow to acknowledge and respond to it.   Responses are below.  <br><br><div class="gmail_quote">On Fri, Nov 14, 2008 at 6:00 PM, Andreas Dilger <span dir="ltr"><<a href="mailto:adilger@sun.com">adilger@sun.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">On Nov 11, 2008  13:23 -0600, John Groves wrote:<br>

> This work is primarily motivated by the need to improve the performance<br>

> of lustre clients as SMB servers to windows nodes.  As I understand it,<br>

> this need is primarily for file readers.<br>

><br>

> Requirements<br>

><br>

> 1. Enabling fscache should be a mount option, and there should be ioctl<br>

>    support for enabling, disabling and querying a file's fscache usage.<br>

<br>

</div>For Lustre there should also be the ability to do this via /proc/fs/lustre<br>

tunables/stats.</blockquote><div><br>Makes sense, thanks.<br> <br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>

<div class="Ih2E3d"><br>

> 2. Data read into the page cache will be asynchronously copied to the<br>

>    disk-based fscache upon arrival.<br>

> 3. If requested data is not present in the page cache, it will be retrieved<br>

>    preferentially from the fscache.  If not present in the fscache, data<br>

>    will be read via RPC.<br>

> 4. When pages are reclaimed due to memory pressure, they should remain in<br>

>    the fscache.<br>

> 5. When a user writes a page (if we support fscache for non-read-only<br>

> opens),<br>

>    the corresponding fscache page must either be invalidated or<br>

>    (more likely) rewritten.<br>

> 6. When a DLM lock is revoked, the entire extent of the lock must be<br>

>    dropped from the fscache (in addition to dropping any page cache<br>

>    resident pages) - regardless of whether any pages are currently resident<br>

>    in the page cache.<br>

> 7. As sort-of a corollary to #6, DLM locks must not be canceled by the owner<br>

>    as long as pages are resident in the fscache, even if memory pressure<br>

>    reclamation has emptied the page cache for a given file.<br>

> 8. Utilities and test programs will be needed, of course.<br>

> 9. The fscache must be cleared upon mount or dismount.<br>

<br>

> High Level Design Points<br>

><br>

> The following is written based primarily on review of the 1.6.5.1 code.<br>

> I'm aware that this is not the place for new development, but it was<br>

> deemed a stable place for initial experimentation.<br>

<br>

</div>Note that the client IO code was substantially re-written for the 2.0<br>

release.  The client IO code from 1.6.5 is still present through the<br>

1.8.x releases.</blockquote><div><br>Understood.<br> <br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">

> Req.    Notes<br>

><br>

>  1.    In current Redhat distributions, fscache is included and<br>

>     NFS includes fscache support, enabled by a mount option.<br>

>     We don't see any problems with doing something similar.<br>

>     A per-file ioctl to enable/disable fscache usage is also seen<br>

>     as straightforward.<br>

><br>

>  2.     When an RPC read (into the page cache) completes, in the<br>

>     ll_ap_completion() function, an asynchronous read to the<br>

>     same offset in the file's fscache object will be initiated.<br>

>     This should not materially impact access time (think dirty page<br>

>     to fscache filesystem).<br>

<br>

</div><br>Do you mean an "asynchronous write to the ... fscache object"?</blockquote><div><br>Yes - write it is.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="Ih2E3d">

>  3.     When the readpage method is invoked because a page is not<br>

>     already resident in the page cache, the page will be read<br>

>     first from the fscache.  This is non-blocking and (presumably)<br>

>     fast for the non-resident case.  If available, the fscache<br>

>     read will proceed asynchronously, after which the page will be<br>

>     valid in the page cache.  If not available in the fscache,<br>

>     the RPC read will proceed normally.<br>

><br>

>  4.     Page removal due to memory pressure is triggered by a call to<br>

>     the llap_shrink_cache function.  This function should not require<br>

>     any material change, since pages can be removed from the page<br>

>     cache without removal from the fscache in this case.  In fact,<br>

>     if this doesn't happen, the fscache will never be read.<br>

>     (note: test coverage will be important here)<br>

><br>

>  5.    It may be reasonable in early code to enable fscache only<br>

>     for read-only opens.  However, we don't see any inherent problems<br>

>     with running an asynchronous write to the fscache concurrently<br>

>     with a Lustre RPC write.  Note that this approach would *never*<br>

>     have dirty pages exist only in the fscache; if it's dirty it<br>

>     stays in the page cache until it's written via RPC (or RPC<br>

>     AND fscache if we're writing to both places)..<br>

<br>

</div><br>This is dangerous from the point of view that the write to the fscache<br>

may succeed, but the RPC may fail for a number of reasons (e.g. client<br>

eviction) so it would seem that the write to the fscache cannot start<br>

until the RPC completes successfully.</blockquote><div><br>Good catch, thanks.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">


>  6 & 7    This is where it gets a little more tedious.  Let me revert to<br>

>     paragraph form to address these cases below.<br>

><br>

>  8    Testing will require the following:<br>

>     * ability to query and punch holes in the page cache (already done).<br>

>     * ability to query and punch holes in the fscache (nearly done).<br>

><br>

>  9  I presume that all locks are canceled when a client dismounts<br>

>     a filesystem, in which case it would never be safe to use data<br>

>     in the fscache from a prior mount.<br>

<br>

</div><br>A potential future improvement in the second generation of this feature<br>

might be the ability to revalidate the files in the local disk cache by<br>

the MDT and OST object versions, if those are also stored in fscache.</blockquote><div><br>Cool idea.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="Ih2E3d">

> Lock Revocation<br>

><br>

> Please apply that "it looks to me like this is how things work" filter here;<br>

> I am still pretty new to Lustre (thanks).  My questions are summarized<br>

> after the the text of this section.<br>

><br>

> As of 1.6.5.1, DLM locks keep a list of page-cached pages<br>

> (lock->l_extents_list contains osc_async_page structs for all currently<br>

> cached pages - and I think the word extent is used both for each page cached<br>

> under a lock, and to describe a locked region...is this right?).  If a lock<br>

> is revoked, that list is torn down and the pages are freed.  Pages are also<br>

> removed from that list when they are freed due to memory pressure, making<br>

> that list sparse with regard to the actual region of the lock.<br>

><br>

> Adding fscache, there will be zero or more page-cache pages in the extent<br>

> list, as well as zero or more pages in the file object in the fscache.<br>

> The primary question, then, is whether a lock will remain valid (i.e. not be<br>

> voluntarily released) if all of the page-cache pages are freed for<br>

> non-lock-related reasons (see question 3 below).<br>

<br>

</div><br>Yes, the lock can remain valid on the client even when no pages are<br>

protected by the lock.  However, locks with few pages are more likely<br>

to be cancelled by the DLM LRU because the cost of re-fetching those<br>

locks is much smaller compared to locks covering lots of data.  The<br>

lock "weight" function would need to be enhanced to include pages that<br>

are in fscache instead of just those in memory.</blockquote><div><br>Got it, thanks.  That would have eluded me...<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="Ih2E3d">

> The way I foresee cleaning up the fscache is by looking at the overall<br>

> extent of the lock (at release or revocation time), and punching a<br>

> lock-extent-sized hole in the fscache object prior to looping through<br>

> the page list (possibly in cache_remove_lock() prior to calling<br>

> cache_remove_extents_from_lock()).</div></blockquote><div><br>FYI it turns out that fscache doesn't have the ability to punch a hole.  The whole file has to be dropped at present.  <br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="Ih2E3d"><br>

><br>

> However, that would require finding the inode, which (AFAICS) is not<br>

> available in that context (ironically, unless the l_extents_list is non-<br>

> empty, in which case the inode can be found via any of the page structs in<br>

> the list).  I have put in a hack to solve this, but see question 6 below.<br>

<br>

</div><br>Actually, each lock has a back-pointer to the inode that is referencing<br>

it, in l_ast_data, so that lock_cancel->mapping->page_removal can work.<br>

Use ll_inode_from_lock() for this.</blockquote><div><br>That's much nicer than my hack...thanks.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="Ih2E3d">

> Summarized questions:<br>

> Q1: Where can I read up on the unit testing infrastructure for Lustre?<br>

<br>

</div><br>There is an internal wiki page with some information on this, it should<br>

probably be moved to the public wiki.</blockquote><div><br>If there's a way to let me know when that happens, I'd appreciate it.  I'm not a full time lustre-devel reader (at least currently).<br><br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="Ih2E3d">

> Q2: Is stale cache already covered by existing unit tests?<br>

<br>

</div><br>I'm not sure what you mean.  There is no such thing as stale cache in<br>

Lustre.</blockquote><div><br>What I was driving at is a test to verify that any page cache data was discarded when a lock was revoked.  The same test would catch failure to discard fscache data, that being a potentially stale place to reload the page cache from.  Perhaps that's implicitly covered somehow.<br>

 </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">

> Q3: Will a DLM lock remain valid (i.e. not be canceled) even if its page<br>

>     list is empty (i.e. all pages have been freed due to memory pressure)?<br>

<br>

</div><br>Yes, though the reverse is impossible.<br>

<div class="Ih2E3d"><br>

> Q4: Will there *always* be a call to cache_remove_lock() when a lock is<br>

>     canceled or revoked?  (i.e. is this the place to punch a hole in the<br>

>     fscache object?)<br>

> Q5: for the purpose of punching a hole in a cache object upon lock<br>

>     revocation, can I rely on the lock->l_req_extent structure as the<br>

>     actual extent of the lock?<br>

<br>

</div>No, there are two different extent ranges on each lock.  The requested<br>

extent, and the granted extent.  The requested extent is the minimum<br>

extent size that the server could possibly grant to the client to finish<br>

the operation (e.g. large enough to handle a single read or write syscall).<br>

The server may decide to grant a larger lock if the resource (object) is<br>

not contended.<br>

<br>

In the current implementation, the DLM will always grant a full-file lock<br>

to the first client that requests it, because the most common application<br>

case is that only a single client is accessing the file.  This avoids any<br>

future lock requests for this file in the majority of cases.</blockquote><div><br>Thanks.  Given that fscache invalidation turns out to be full-file anyway, this becomes moot for the time being.<br> <br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="Ih2E3d">

> Q6: a) is there a way to find the inode that I've missed?, and<br>

>     b) if not what is the preferred way of giving that function a way to<br>

>     find the inode?<br>

<br>

</div><br>See above.<br>

<div class="Ih2E3d"><br>

> FYI we have done some experimenting and we have the read path in a<br>

> demonstrable state, including crude code to effect lock revocation on the<br>

> fscache contents.  The NFS code modularized the fscache hooks pretty nicely,<br>

> and we have followed that example.<br>

<br>

</div>Cheers, Andreas<br>

<font color="#888888">--<br>

Andreas Dilger<br>

Sr. Staff Engineer, Lustre Group<br>

Sun Microsystems of Canada, Inc.<br>

<br>

<br>

</font></blockquote></div><br><br>Thanks again!<br>John <br><br><br><br><br>