[Lustre-devel] readdir for striped dir
adilger at sun.com
Tue Mar 23 11:23:02 PDT 2010
On 2010-03-23, at 10:15, Nikita Danilov wrote:
> On 23 March 2010 14:29, Tom.Wang <Tom.Wang at sun.com> wrote:
>> LMV will use new hash function to select stripe object (mdc), which
>> could be
>> independent with the one used in the storage. In mdc level, it
>> just need to map the entries of each dir stripe object in the
>> cache, we can index the cache in anyway as we want, probably hash
>> order (as the server storage) is a good choice, because client can
>> easily find and cancel the pages by the hash in later dir-extent
>> lock. Note: Even in this case, client does not need to know server
>> hash scheme at all, since server will set the hash-offset of these
>> pages, client just need to put these pages on the cache by hash-
>> Currently, the cache will only be touched by readdir. If the cache
>> will be
>> used by readdir-plus later, i.e. we need locate the entry by name,
>> then client must use the same hash as the server storage, but
>> server will tell client which hash function it use. Yes, different
>> hash per dirstripe should not be a problem here.
> If I understand correctly, the scheme has an advantage of cleaner
> solution for readdir scalability than "hash adjust" hack of CMD3 (see
> comment in lmv/lmv_obd.c:lmv_readpage()).
Yes, I saw this for the first time just a few days ago and had to
shield my eyes :-).
> The problem, to remind, is that if a number of clients readdir the
> same split directory, they all hammer the same servers one after
> another, negating the advantages of meta-data clustering. The
> solution is to cyclically shift hash space by an offset depending on
> the client, so that clients load servers uniformly. With 2-level
> hashing, this shifting can be done entirely within new LMV hash
Sure. Even without the 2-level hashing I wondered why the readdir
pages weren't simply pre-fetched from a random MDT index (N %
nstripes) by each client into its cache.
One question is whether it is important for applications to get
entries back from readdir in a consistent order between invocations,
which would imply that N should be persistent across calls (e.g.
NID). If it is important for the application to get the entries in
the same order, it would mean higher latency on some clients to return
the first page, but thereafter the clients would all be pre-fetching
round-robin. In aggregate it would speed up performance, however, by
distributing the RPC traffic more evenly, and also by starting the
disk IO on all of the MDTs concurrently instead of in series.
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-devel