[Lustre-devel] CMD directory split

Peter Braam Peter.Braam at Sun.COM
Tue Sep 16 09:00:28 PDT 2008

If these dynamic layout changes are being considered, doing so for file data
first or in parallel might make sense.

I think that fixed directory striping patterns may in fact be fine for a
long tim to come. Spread directory entries over a particular pool or over
any nodes using a certain width may be all we need.


On 9/16/08 5:05 AM, "Eric Barton" <eeb at sun.com> wrote:

> Guys,
> I'm cc-ing lustre-devel - it's of general interest.
> I definitely think the first CMD product releases should stick to
> static directory layouts with all directories contained within a
> single MDT by default - i.e. you have to do something special to
> create a striped dir.  Meanwhile we should start ASAP on completing
> the design of automatic dir splitting to handle _all_ the recovery
> cases.
> IMHO, a one-time-only directory split seems a bit too all-or-nothing.
> What is the reasoning behind the assumption that further splitting is
> not required?  Also, directory size doesn't necessarily seem like the
> only or even the best clue about when to split and how.  So here are a
> couple of suggestions.
> 1. Consider the split from 1 MDT to several just as a special case of
>    migration - i.e. allow arbitrary n->m re-layout over MDTs.  We
>    would like to support metadata migration in any case for space
>    management.  Andreas' ideas about migration by mirroring seem
>    equally applicable to directories (furthermore mirrored directories
>    seem like a valuable component of a namespace availability and
>    resilience feature).
> 2. Keep the discussion on policy (when to split) separate from
>    mechanism (how to split).
>     Cheers,
>               Eric
>> -----Original Message-----
>> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of
>> Andreas Dilger
>> Sent: 16 September 2008 9:23 AM
>> To: Nikita Danilov
>> Cc: Yury Umanets; Eric Barton; Alex Zhuravlev
>> Subject: Re: CMD directory split
>> On Sep 15, 2008  14:20 +0400, Nikita Danilov wrote:
>>> Andreas Dilger writes:
>>>> I think we can have a very simple directory split if we
>>>> consider the directory split like file IO instead of inserting
>>>> thousands of dirents.
>>> In some sense this is already done. Split (cmm/cmm_split.c) uses
>>> the same interface as readdir to construct a pagesful of directory
>>> entries and to send them to a slave mdt, where they are
>>> inserted. It is not raw `write', because actual directory page
>>> format is encapsulated within osd
>> Yes, I was thinking of something like this, I didn't know that is
>> how it is actually handled.  The other issue is that the directory
>> creation and insertions should be done in a single transaction in
>> order to simplify recovery.  In that case we don't have to worry
>> about the case where the directory is partially created and
>> populated.
>>> It's my impression that other source of trouble with split is a
>>> complicated locking scheme that it requires to keep clients happy.
>> How is the locking of the directory any different than the locking
>> on the LOV EA needed to restripe a file for migration?  The locks on
>> the inodes themselves do not change, because the inodes are not
>> moving.  The pages on the directory itself are revoked on a regular
>> basis whenever there is a new insertion in any case (i.e. when the
>> directory is split).
>> It would seem that the only thing which needs to be changed is the
>> LMV EA on the clients.
>>> It would be _much_ easier if directories were split at the time of
>>> creation like files are. That would also eliminate almost all
>>> recovery issues and page-shuffling mechanics.
>> This might be possible for a simple initial implementation, but it
>> isn't a good long-term solution.  Consider the problems we face even
>> today with widely-striped files - stat slowdowns to track the
>> size/mtime/ctime, readdir will always have to do RPCs to each MDT to
>> get the entries even if there are only a few entries, unlinks will
>> need multiple RPCs, etc.
>> In contrast, we can tune the split threshold so that the majority of
>> small directories remain on a single MDT, and only large directories
>> are split (with some small overhead for the split).
>> In the future we have to consider configurations with hundreds or
>> even thousands of MDTs, perhaps one on each OST, in order to scale
>> metadata and small file performance dramatically.
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

More information about the lustre-devel mailing list