[Lustre-devel] readdir for striped dir
Andreas Dilger
adilger at sun.com
Wed Mar 24 15:23:24 PDT 2010
On 2010-03-23, at 13:14, Nikita Danilov wrote:
> On 23 March 2010 21:23, Andreas Dilger <adilger at sun.com> wrote:
>>
>>> The problem, to remind, is that if a number of clients readdir the
>>> same
>>> split directory, they all hammer the same servers one after another,
>>> negating the advantages of meta-data clustering. The solution is to
>>> cyclically shift hash space by an offset depending on the client,
>>> so that
>>> clients load servers uniformly. With 2-level hashing, this
>>> shifting can be
>>> done entirely within new LMV hash function.
>>
>> Sure. Even without the 2-level hashing I wondered why the readdir
>> pages
>> weren't simply pre-fetched from a random MDT index (N % nstripes)
>> by each
>> client into its cache.
>
> Do you mean that a client reads from servers in order N, N + 1, ..., N
> - 1? Or that all clients read pages from servers in the same order 1,
> 2, ... nrstripes and in addition every client pre-fetches from servers
> N, N + 1, ..., N - 1?
What I meant was that each client starts its read on a different
stripe (e.g. dir_stripe_index = client_nid % num_stripes), and reads(-
ahead, optionally) chunks in (approximately?) round-robin order from
the starting index, but it still returns the readdir data back to
userspace as if it was reading starting at dir_stripe_index = 0.
That implies that, depending on the client NID, the client may buffer
up to (num_stripes - 1) reads (pages or MB, depending on how much is
read per RPC) until it gets to the 0th stripe index and can start
returning entries from that stripe to userspace. The readahead of
directory stripes should always be offset from where it is currently
processing, so that the clients continue to distribute load across
MDTs even when they are working from cache.
> As to the first case, a directory entry hash value is used as a value
> of ->d_off field in struct dirent. Historically this field means byte
> offset in a directory file and hash adjustment tries to maintain its
> monotonicity through readdir iterations.
Agreed, though in the case of DMU backing storage (or even ldiskfs
with a different hash function) there will be entries in each page
read from each MDT that may have overlapping values. There will need
to be some way to disambiguate hash X that was read from MDT 0 from
hash X from MDT 1.
It may be that the prefetching described above will help this. If the
client is doing hash-ordered reads from each MDT, it could be merging
the entries on the client to return to userspace in strict hash order,
even though the client doesn't know the underlying hash function
used. Presumably, with a well-behaved hash function, the entries in
each stripe are uniformly distributed, so progress will be made
through all stripes in a relatively uniform manner (i.e. reads will be
going to all MDTs at about the same rate).
>> One question is whether it is important for applications to get
>> entries back
>> from readdir in a consistent order between invocations, which would
>> imply
>> that N should be persistent across calls (e.g. NID). If it is
>> important for
>> the application to get the entries in the same order, it would mean
>> higher
>> latency on some clients to return the first page, but thereafter
>> the clients
>> would all be pre-fetching round-robin. In aggregate it would speed
>> up
>> performance, however, by distributing the RPC traffic more evenly,
>> and also
>> by starting the disk IO on all of the MDTs concurrently instead of in
>> series.
>
> POSIX doesn't guarantee readdir repeatability, I am not sure about NT.
In SUSv2 I didn't see any mention of entry ordering, per se, though
telldir() and seekdir() should presumably be repeatable for NFS re-
export, which implies that the user-visible ordering can't change
randomly between invocations, and it shouldn't change between clients
or clustered NFS export would fail.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-devel
mailing list