[Lustre-devel] Sub Tree lock ideas.
Oleg.Drokin at Sun.COM
Tue Jan 27 20:39:43 PST 2009
On Jan 26, 2009, at 5:08 AM, Andreas Dilger wrote:
> A few comments that I have from the later discussions:
> - you previously mentioned that only a single client would be able to
> hold a subtree lock. I think it is critical that multiple clients be
> able to get read subtree locks on the same directory. This would be
> very important for uses like many clients and a shared read-mostly
> directory like /usr/bin or /usr/lib.
In fact I see zero benefit for read-only subtree lock except memory
conservation, which should not be such a big issue. Much more important
is to reduce amount of RPCs, esp. synchronous ones.
> - Alex (I think) suggested that the STL locks would only be on a
> directory and its contents, instead of being on an arbitrary depth
> sub-tree. While it seems somewhat appealing to have a single lock
> that covers an entire subtree, the complexity of having to locate
> and manage arbitrary-depth locks on the MDS might be too high.
> Having only a single-level of subtree lock would avoid the need to
> pass cookies to the MDS for anything other than the directory in
> which names are being looked up.
I had a lengthly call with Eric today and at the end we came to a
conclusion that perhaps STL at the moment is a total overkill.
What we need is ability to reduce metadata RPCs traffic.
We can start with implementation of just allowing WRITE locks on a
directory that would be only responsible for this directory and its
content (HELPS: by allowing to aggregate creates into bunches before
sending) + a special "entire file lock" (perhaps implemented by just
WRITE lock on a file) metadata lock that would guard all file data
without obtaining any locks from OSTs (would be revoked by open from
another client, perhaps would need to support glimpses too).
The WRITE directory lock only helps us to aggregate metadata RPCs if we
just created the empty directory OR if we have entire list of entries in
that directory. If we do not have entire directory content, we must
synchronous create RPC to avoid cases where we locally create a file
already exists in that dir, for example. So perhaps in a lot of cases
obtaining a write lock on a dir would need to be followed by some sort
bulk directory read (readdir+ of sorts). This is also not always
as I can imagine there could be directories much bigger than what we
like to cache, in which case we would need to resort to one-by-one
Another important thing we would need is lock conversion
try-up conversion) so that we do not lose our entire cached directory
after conflicting ls came in and we wrote it out. (we do not care all
much about writing out entire content of the dirty metadata cache at
point, since we still achieve the aggregation and asynchronous creation,
even just asynchronous creation would help).
Perhaps another useful addition would be to deliver multiple blocking
glimpse callbacks from server to the client in a single RPC (as a result
of a readdir+ sort of operation inside a dir where many files have
file lock") (we already have aggregated cancels in the other direction).
This WRITE metadata lock is in fact a reduced subset of STL lock
of its advanced features, but perhaps easier to implement because of
More information about the lustre-devel