[lustre-devel] [RPC] parallel directory operations for mainline Linux
NeilBrown
neilb at suse.com
Wed Dec 19 15:39:11 PST 2018
On Wed, Dec 19 2018, Andreas Dilger wrote:
> On Nov 22, 2018, at 21:44, NeilBrown <neilb at suse.com> wrote:
>>
>>
>> One of the remaining features of ldiskfs which is not in ext4fs is
>> parallel directory ops.
>> It would not be possible to get this upstream without VFS support
>> for parallel directory ops. Lustre doesn't use the VFS interfaces
>> so this lack is not an immediate problem for lustre, but it is a
>> real problem for upstreaming.
>>
>> This patch (which seems to work in my testing so far, but is probably
>> still buggy) adds VFS support for parallel dir ops - create and remove.
>> I haven't attempted rename - it would be complex for various reasons and
>> while I'm sure it is possible, I'm not sure it is worth the effort.
>>
>> With this patch a filesystem can indicate that it supports parallel ops
>> by setting a flag on a directory. The VFS will then get exclusive
>> access to the dentry - instead of the whole directory - when
>> performing the operation.
>>
>> A filesystem which supports this much have its own locking to ensure
>> that lookup, readdir, create, unlink can all happen in parallel.
>> For NFS this is easy as the server takes care of those details, so
>> this patch also adds parallel-ops support for NFS.
>> For a filesystem like ext4 it would mean adding some locking to
>> the internal data structures.
>>
>> I've had a bit of a look at the parallel-ops patch for ldiskfs and I
>> think it is over-engineered. We don't need a new locking primitive.
>>
>> I suspect I would start by adding a seqlock to each htree node.
>> This allows reads to proceed locklessly when no changes are happening
>> (if they are careful not to get confused by an inconsistent node).
>> A modification would normally find the relevant leaf with a similar
>> lockless walk, then lock the leaf, verify the seq-lock on the parent
>> hasn't changed, and perform the update.
>> In the rarer case when a leaf needs to split or merge something more
>> heavy handed would be needed - possibly lock the whole tree - possibly
>> just lock a higher node.
>>
>> I don't expect to look at ext4 parallel ops in more detail in the
>> immediate future, and I don't plan to post this upstream until we have
>> credible support in ext4. So I'm just posting it here now in case
>> anyone else want to explore how to make ext4 work with this.
Thanks for having a look.
>
> Why not start with the existing pdirops patch for ext4, which has been
> in production use for many years already and works very well? Since
> this doesn't affect the on-disk structures at all, it would be
> possible to improve/replace it in the future if needed.
Yes, that would be a good place to start. I suspect a lot of work would
be needed to make it upstreamable, but having a known-working base makes
sense.
>
> I had a quick pass through your patch, and it seems reasonable.
> The main question is whether the VFS interface is suitable for
> use with ext4 with (some) pdirops patch, or does it need some more
> changes?
This is exactly the question - can pdirops and the VFS interface meeting
in the middle? What changes are required to each to make that work?
I'd like to dig into that question, but not for a while yet.
> Unfortunately, even with the pdirops patch applied, ext4
> does not use this locking directly from the VFS methods, since
> there was no reason to add overhead to the normal interface. The
> htree locks are only supplied when calling into the code directly
> from Lustre.
Ideally, locking is cheap when there is no conflict - that is certainly
the aim of RCU and seq-locks. Hopefully we can add the locking without
introducing noticeable cost.
Thanks,
NeilBrown
>
> Cheers, Andreas
>
>> From 827c01aee1cb74b72e5dbb2f40c01666b914bc15 Mon Sep 17 00:00:00 2001
>> From: NeilBrown <neilb at suse.com>
>> Date: Fri, 16 Nov 2018 19:58:53 +1100
>> Subject: [PATCH] VFS: support parallel updates in the one directory.
>>
>> Some filesystems can support parallel modifications to a directory,
>> either because the modification happen on a remote server which does
>> its own locking (e.g. NFS) or because they can internally lock just
>> a part of a directory (e.g. many local filesystems, with a bit of
>> work).
>>
>> To support these, we introduce support for parallel modification:
>> unlink (including rmdir) and create.
>>
>> If a filesystem supports parallel modification in a given directory,
>> it sets S_PAR_UNLINK on the inode for that directory.
>> lookup_open() and the new lookup_hash_modify() (similar to
>> __lookup_hash()) notice the flag and take a shared
>> lock on the directory.
>>
>> Once a dentry for the target name has been obtained,
>> DCACHE_PAR_UPDATE is set on it, waiting if necessary.
>> Once this is set, the thread has exclusive access to the
>> name and can call into the filesystem to perform
>> the required action.
>>
>> Some files do *not* complete the lookup that precedes
>> a create, but leave the dentry d_in_lookup() and unhashed,
>> so often a dentry will have both DCACHE_PAR_LOOKUP and
>> DCACHE_PAR_UPDATE set at the same time. To allow
>> for this, we need the 'wq' that is used when DCACHE_PAR_LOOKUP is
>> cleared, to exist until the creation is complete. We also
>> need to re-initialize it if it might get re-used.
>>
>> As NFS trivially supports parallel unlinks, this patch also adds the
>> flag to all NFS directories.
>>
>> Signed-off-by: NeilBrown <neilb at suse.com>
>> ---
>> fs/dcache.c | 37 ++++++++++
>> fs/namei.c | 189 ++++++++++++++++++++++++++++++++++++++++++-------
>> fs/nfs/dir.c | 2 +-
>> fs/nfs/inode.c | 2 +
>> fs/nfs/unlink.c | 4 +-
>> include/linux/dcache.h | 43 +++++++++++
>> include/linux/fs.h | 1 +
>> 7 files changed, 249 insertions(+), 29 deletions(-)
>>
>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index 2593153471cf..3821ce0bc37f 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -3030,6 +3030,43 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
>> }
>> EXPORT_SYMBOL(d_tmpfile);
>>
>> +/*
>> + * Lock a dentry to unlink or create in an S_PAR_UPDATE directory.
>> + * After a successful return the dentry will not be modified by any other
>> + * thread, and still has the given name and parent.
>> + * On unsuccessful return it is not locked, because it was either unlinked
>> + * or renamed in the mean-time. If it was instantiated or created,
>> + * we still return success, so caller might need to test if the dentry
>> + * is negative or positive.
>> + */
>> +bool d_lock_modify(struct dentry *dentry,
>> + struct dentry *base, const struct qstr *name)
>> +{
>> + bool ret = true;
>> +
>> + spin_lock(&dentry->d_lock);
>> + if (dentry->d_flags & DCACHE_PAR_UPDATE)
>> + ___wait_var_event(&dentry->d_flags,
>> + !(dentry->d_flags & DCACHE_PAR_UPDATE),
>> + TASK_UNINTERRUPTIBLE, 0, 0,
>> + (spin_unlock(&dentry->d_lock),
>> + schedule(),
>> + spin_lock(&dentry->d_lock))
>> + );
>> + if (d_unhashed(dentry) && d_is_positive(dentry))
>> + /* name was unlinked while we waited */
>> + ret = false;
>> + else if (dentry->d_parent != base ||
>> + dentry->d_name.hash != name->hash ||
>> + !d_same_name(dentry, base, name))
>> + /* dentry was renamed - possibly silly-rename */
>> + ret = false;
>> + else
>> + dentry->d_flags |= DCACHE_PAR_UPDATE;
>> + spin_unlock(&dentry->d_lock);
>> + return ret;
>> +}
>> +
>> static __initdata unsigned long dhash_entries;
>> static int __init set_dhash_entries(char *str)
>> {
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 0cab6494978c..ab6ccc03b9f4 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -1551,6 +1551,91 @@ static struct dentry *__lookup_hash(const struct qstr *name,
>> return dentry;
>> }
>>
>> +/*
>> + * Parent directory is not locked. We take either an exclusive
>> + * or shared lock depending on the fs preference, and then
>> + * get the DCACHE_PAR_UPDATE bit.
>> + */
>> +static struct dentry *lookup_hash_modify(const struct qstr *name,
>> + struct dentry *base, unsigned int flags,
>> + wait_queue_head_t *wq)
>> +{
>> + struct dentry *dentry;
>> + struct inode *dir = base->d_inode;
>> + bool shared = (dir->i_flags & S_PAR_UPDATE) && wq;
>> + int err;
>> +
>> + if (shared)
>> + inode_lock_shared_nested(dir, I_MUTEX_PARENT);
>> + else
>> + inode_lock_nested(dir, I_MUTEX_PARENT);
>> +
>> +retry:
>> + dentry = lookup_dcache(name, base, flags);
>> + if (!dentry) {
>> + /* Don't create child dentry for a dead directory. */
>> + err = -ENOENT;
>> + if (unlikely(IS_DEADDIR(dir)))
>> + goto out_err;
>> +
>> + if (shared)
>> + dentry = d_alloc_parallel(base, name, wq);
>> + else
>> + dentry = d_alloc(base, name);
>> +
>> + if (!IS_ERR(dentry) &&
>> + (!shared || d_in_lookup(dentry))) {
>> + struct dentry *old;
>> +
>> + old = dir->i_op->lookup(dir, dentry, flags);
>> + /*
>> + * Note: dentry might still be d_unhashed() and
>> + * d_in_lookup() if the fs will do the lookup
>> + * at 'create' time.
>> + */
>> + if (unlikely(old)) {
>> + d_lookup_done(dentry);
>> + dput(dentry);
>> + dentry = old;
>> + }
>> + }
>> + }
>> + if (IS_ERR(dentry)) {
>> + err = PTR_ERR(dentry);
>> + goto out_err;
>> + }
>> + if (!shared || d_lock_modify(dentry, base, name))
>> + return dentry;
>> +
>> + /* Failed to get lock due to race with unlink or rename */
>> + d_lookup_done(dentry);
>> + init_waitqueue_head(wq);
>> + dput(dentry);
>> + goto retry;
>> +
>> +out_err:
>> + if (shared)
>> + inode_unlock_shared(dir);
>> + else
>> + inode_unlock(dir);
>> + return ERR_PTR(err);
>> +}
>> +
>> +static void lookup_done_modify(struct path *path, struct dentry *dentry,
>> + wait_queue_head_t *wq)
>> +{
>> + struct inode *dir = path->dentry->d_inode;
>> + bool shared = (dir->i_flags & S_PAR_UPDATE) && wq;
>> +
>> + if (shared) {
>> + d_lookup_done(dentry);
>> + d_unlock_modify(dentry);
>> + inode_unlock_shared(dir);
>> + } else {
>> + inode_unlock(dir);
>> + }
>> +}
>> +
>> static int lookup_fast(struct nameidata *nd,
>> struct path *path, struct inode **inode,
>> unsigned *seqp)
>> @@ -3136,6 +3221,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>> int error, create_error = 0;
>> umode_t mode = op->mode;
>> DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>> + bool have_par_update = false;
>>
>> if (unlikely(IS_DEADDIR(dir_inode)))
>> return -ENOENT;
>> @@ -3201,10 +3287,22 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>> }
>>
>> if (dir_inode->i_op->atomic_open) {
>> + /* dentry is negative or d_in_lookup(). If this is a shared-lock
>> + * create we need to get DCACHE_PAR_UPDATE to ensure exclusion
>> + */
>> + if ((open_flag & O_CREAT) &&
>> + (dir->d_inode->i_flags & S_PAR_UPDATE)) {
>> + if (!d_lock_create(dentry))
>> + /* already exists, non-atomic open */
>> + goto out_no_open;
>> + have_par_update = true;
>> + }
>> error = atomic_open(nd, dentry, path, file, op, open_flag,
>> mode);
>> if (unlikely(error == -ENOENT) && create_error)
>> error = create_error;
>> + if (have_par_update)
>> + d_unlock_modify(dentry);
>> return error;
>> }
>>
>> @@ -3222,6 +3320,13 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>> dentry = res;
>> }
>> }
>> + /* If dentry is negative and this is a shared-lock
>> + * create we need to get DCACHE_PAR_UPDATE to ensure exclusion
>> + */
>> + if ((open_flag & O_CREAT) &&
>> + !dentry->d_inode &&
>> + (dir->d_inode->i_flags & S_PAR_UPDATE))
>> + have_par_update = d_lock_create(dentry);
>>
>> /* Negative dentry, just create the file */
>> if (!dentry->d_inode && (open_flag & O_CREAT)) {
>> @@ -3242,11 +3347,15 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>> goto out_dput;
>> }
>> out_no_open:
>> + if (have_par_update)
>> + d_unlock_modify(dentry);
>> path->dentry = dentry;
>> path->mnt = nd->path.mnt;
>> return 0;
>>
>> out_dput:
>> + if (have_par_update)
>> + d_unlock_modify(dentry);
>> dput(dentry);
>> return error;
>> }
>> @@ -3266,6 +3375,7 @@ static int do_last(struct nameidata *nd,
>> struct inode *inode;
>> struct path path;
>> int error;
>> + bool shared;
>>
>> nd->flags &= ~LOOKUP_PARENT;
>> nd->flags |= op->intent;
>> @@ -3317,12 +3427,13 @@ static int do_last(struct nameidata *nd,
>> * dropping this one anyway.
>> */
>> }
>> - if (open_flag & O_CREAT)
>> + shared = !!(dir->d_inode->i_flags & S_PAR_UPDATE);
>> + if ((open_flag & O_CREAT) && !shared)
>> inode_lock(dir->d_inode);
>> else
>> inode_lock_shared(dir->d_inode);
>> error = lookup_open(nd, &path, file, op, got_write);
>> - if (open_flag & O_CREAT)
>> + if ((open_flag & O_CREAT) && !shared)
>> inode_unlock(dir->d_inode);
>> else
>> inode_unlock_shared(dir->d_inode);
>> @@ -3600,7 +3711,8 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
>> }
>>
>> static struct dentry *filename_create(int dfd, struct filename *name,
>> - struct path *path, unsigned int lookup_flags)
>> + struct path *path, unsigned int lookup_flags,
>> + wait_queue_head_t *wq)
>> {
>> struct dentry *dentry = ERR_PTR(-EEXIST);
>> struct qstr last;
>> @@ -3632,8 +3744,7 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>> * Do the final lookup.
>> */
>> lookup_flags |= LOOKUP_CREATE | LOOKUP_EXCL;
>> - inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
>> - dentry = __lookup_hash(&last, path->dentry, lookup_flags);
>> + dentry = lookup_hash_modify(&last, path->dentry, lookup_flags, wq);
>> if (IS_ERR(dentry))
>> goto unlock;
>>
>> @@ -3658,10 +3769,10 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>> putname(name);
>> return dentry;
>> fail:
>> + lookup_done_modify(path, dentry, wq);
>> dput(dentry);
>> dentry = ERR_PTR(error);
>> unlock:
>> - inode_unlock(path->dentry->d_inode);
>> if (!err2)
>> mnt_drop_write(path->mnt);
>> out:
>> @@ -3674,23 +3785,34 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
>> struct path *path, unsigned int lookup_flags)
>> {
>> return filename_create(dfd, getname_kernel(pathname),
>> - path, lookup_flags);
>> + path, lookup_flags, NULL);
>> }
>> EXPORT_SYMBOL(kern_path_create);
>>
>> -void done_path_create(struct path *path, struct dentry *dentry)
>> +void __done_path_create(struct path *path, struct dentry *dentry,
>> + wait_queue_head_t *wq)
>> {
>> + lookup_done_modify(path, dentry, wq);
>> dput(dentry);
>> - inode_unlock(path->dentry->d_inode);
>> mnt_drop_write(path->mnt);
>> path_put(path);
>> }
>> +void done_path_create(struct path *path, struct dentry *dentry)
>> +{
>> + __done_path_create(path, dentry, NULL);
>> +}
>> EXPORT_SYMBOL(done_path_create);
>>
>> +inline struct dentry *user_path_create_wq(int dfd, const char __user *pathname,
>> + struct path *path, unsigned int lookup_flags,
>> + wait_queue_head_t *wq)
>> +{
>> + return filename_create(dfd, getname(pathname), path, lookup_flags, wq);
>> +}
>> inline struct dentry *user_path_create(int dfd, const char __user *pathname,
>> - struct path *path, unsigned int lookup_flags)
>> + struct path *path, unsigned int lookup_flags)
>> {
>> - return filename_create(dfd, getname(pathname), path, lookup_flags);
>> + return filename_create(dfd, getname(pathname), path, lookup_flags, NULL);
>> }
>> EXPORT_SYMBOL(user_path_create);
>>
>> @@ -3747,12 +3869,13 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>> struct path path;
>> int error;
>> unsigned int lookup_flags = 0;
>> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>>
>> error = may_mknod(mode);
>> if (error)
>> return error;
>> retry:
>> - dentry = user_path_create(dfd, filename, &path, lookup_flags);
>> + dentry = user_path_create_wq(dfd, filename, &path, lookup_flags, &wq);
>> if (IS_ERR(dentry))
>> return PTR_ERR(dentry);
>>
>> @@ -3776,7 +3899,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>> break;
>> }
>> out:
>> - done_path_create(&path, dentry);
>> + __done_path_create(&path, dentry, &wq);
>> if (retry_estale(error, lookup_flags)) {
>> lookup_flags |= LOOKUP_REVAL;
>> goto retry;
>> @@ -3827,9 +3950,10 @@ long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
>> struct path path;
>> int error;
>> unsigned int lookup_flags = LOOKUP_DIRECTORY;
>> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>>
>> retry:
>> - dentry = user_path_create(dfd, pathname, &path, lookup_flags);
>> + dentry = user_path_create_wq(dfd, pathname, &path, lookup_flags, &wq);
>> if (IS_ERR(dentry))
>> return PTR_ERR(dentry);
>>
>> @@ -3838,7 +3962,7 @@ long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
>> error = security_path_mkdir(&path, dentry, mode);
>> if (!error)
>> error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
>> - done_path_create(&path, dentry);
>> + __done_path_create(&path, dentry, &wq);
>> if (retry_estale(error, lookup_flags)) {
>> lookup_flags |= LOOKUP_REVAL;
>> goto retry;
>> @@ -3904,6 +4028,7 @@ long do_rmdir(int dfd, const char __user *pathname)
>> struct qstr last;
>> int type;
>> unsigned int lookup_flags = 0;
>> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>> retry:
>> name = filename_parentat(dfd, getname(pathname), lookup_flags,
>> &path, &last, &type);
>> @@ -3926,8 +4051,7 @@ long do_rmdir(int dfd, const char __user *pathname)
>> if (error)
>> goto exit1;
>>
>> - inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
>> - dentry = __lookup_hash(&last, path.dentry, lookup_flags);
>> + dentry = lookup_hash_modify(&last, path.dentry, lookup_flags, &wq);
>> error = PTR_ERR(dentry);
>> if (IS_ERR(dentry))
>> goto exit2;
>> @@ -3940,9 +4064,9 @@ long do_rmdir(int dfd, const char __user *pathname)
>> goto exit3;
>> error = vfs_rmdir(path.dentry->d_inode, dentry);
>> exit3:
>> + lookup_done_modify(&path, dentry, &wq);
>> dput(dentry);
>> exit2:
>> - inode_unlock(path.dentry->d_inode);
>> mnt_drop_write(path.mnt);
>> exit1:
>> path_put(&path);
>> @@ -3965,7 +4089,8 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
>> * @dentry: victim
>> * @delegated_inode: returns victim inode, if the inode is delegated.
>> *
>> - * The caller must hold dir->i_mutex.
>> + * The caller must either hold a write-lock on dir->i_rwsem, or
>> + * a readlock having atomically cleared DCACHE_PAR_UNLINK.
>> *
>> * If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and
>> * return a reference to the inode in delegated_inode. The caller
>> @@ -4022,6 +4147,11 @@ EXPORT_SYMBOL(vfs_unlink);
>> * directory's i_mutex. Truncate can take a long time if there is a lot of
>> * writeout happening, and we don't want to prevent access to the directory
>> * while waiting on the I/O.
>> + * If both the directory and the target dentry have DCACHE_PAR_UNLINK set,
>> + * we do the unlink with a shared lock on the directory while clearing
>> + * DCACHE_PAR_UNLINK on the target. IF the flags is not set, then either
>> + * the filesystem doesn't support parallel unlinks, or we are racing with
>> + * another thread unlinking the same name.
>> */
>> long do_unlinkat(int dfd, struct filename *name)
>> {
>> @@ -4033,6 +4163,7 @@ long do_unlinkat(int dfd, struct filename *name)
>> struct inode *inode = NULL;
>> struct inode *delegated_inode = NULL;
>> unsigned int lookup_flags = 0;
>> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>> retry:
>> name = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
>> if (IS_ERR(name))
>> @@ -4045,9 +4176,9 @@ long do_unlinkat(int dfd, struct filename *name)
>> error = mnt_want_write(path.mnt);
>> if (error)
>> goto exit1;
>> +
>> retry_deleg:
>> - inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
>> - dentry = __lookup_hash(&last, path.dentry, lookup_flags);
>> + dentry = lookup_hash_modify(&last, path.dentry, lookup_flags, &wq);
>> error = PTR_ERR(dentry);
>> if (!IS_ERR(dentry)) {
>> /* Why not before? Because we want correct error value */
>> @@ -4062,14 +4193,15 @@ long do_unlinkat(int dfd, struct filename *name)
>> goto exit2;
>> error = vfs_unlink(path.dentry->d_inode, dentry, &delegated_inode);
>> exit2:
>> + lookup_done_modify(&path, dentry, &wq);
>> dput(dentry);
>> }
>> - inode_unlock(path.dentry->d_inode);
>> if (inode)
>> iput(inode); /* truncate the inode here */
>> inode = NULL;
>> if (delegated_inode) {
>> error = break_deleg_wait(&delegated_inode);
>> + init_waitqueue_head(&wq);
>> if (!error)
>> goto retry_deleg;
>> }
>> @@ -4079,6 +4211,7 @@ long do_unlinkat(int dfd, struct filename *name)
>> if (retry_estale(error, lookup_flags)) {
>> lookup_flags |= LOOKUP_REVAL;
>> inode = NULL;
>> + init_waitqueue_head(&wq);
>> goto retry;
>> }
>> putname(name);
>> @@ -4139,12 +4272,13 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>> struct dentry *dentry;
>> struct path path;
>> unsigned int lookup_flags = 0;
>> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>>
>> from = getname(oldname);
>> if (IS_ERR(from))
>> return PTR_ERR(from);
>> retry:
>> - dentry = user_path_create(newdfd, newname, &path, lookup_flags);
>> + dentry = user_path_create_wq(newdfd, newname, &path, lookup_flags, &wq);
>> error = PTR_ERR(dentry);
>> if (IS_ERR(dentry))
>> goto out_putname;
>> @@ -4152,7 +4286,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>> error = security_path_symlink(&path, dentry, from->name);
>> if (!error)
>> error = vfs_symlink(path.dentry->d_inode, dentry, from->name);
>> - done_path_create(&path, dentry);
>> + __done_path_create(&path, dentry, &wq);
>> if (retry_estale(error, lookup_flags)) {
>> lookup_flags |= LOOKUP_REVAL;
>> goto retry;
>> @@ -4270,6 +4404,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>> struct inode *delegated_inode = NULL;
>> int how = 0;
>> int error;
>> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>>
>> if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0)
>> return -EINVAL;
>> @@ -4291,8 +4426,8 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>> if (error)
>> return error;
>>
>> - new_dentry = user_path_create(newdfd, newname, &new_path,
>> - (how & LOOKUP_REVAL));
>> + new_dentry = user_path_create_wq(newdfd, newname, &new_path,
>> + (how & LOOKUP_REVAL), &wq);
>> error = PTR_ERR(new_dentry);
>> if (IS_ERR(new_dentry))
>> goto out;
>> @@ -4308,7 +4443,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>> goto out_dput;
>> error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
>> out_dput:
>> - done_path_create(&new_path, new_dentry);
>> + __done_path_create(&new_path, new_dentry, &wq);
>> if (delegated_inode) {
>> error = break_deleg_wait(&delegated_inode);
>> if (!error) {
>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> index 71b2e390becf..9a3c51f3040b 100644
>> --- a/fs/nfs/dir.c
>> +++ b/fs/nfs/dir.c
>> @@ -57,7 +57,7 @@ static void nfs_readdir_clear_array(struct page*);
>> const struct file_operations nfs_dir_operations = {
>> .llseek = nfs_llseek_dir,
>> .read = generic_read_dir,
>> - .iterate = nfs_readdir,
>> + .iterate_shared = nfs_readdir,
>> .open = nfs_opendir,
>> .release = nfs_closedir,
>> .fsync = nfs_fsync_dir,
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index 5b1eee4952b7..9a6dac08cc79 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -453,6 +453,8 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st
>>
>> /* We can't support update_atime(), since the server will reset it */
>> inode->i_flags |= S_NOATIME|S_NOCMTIME;
>> + /* Parallel updates to directory are trivial */
>> + inode->i_flags |= S_PAR_UPDATE;
>> inode->i_mode = fattr->mode;
>> nfsi->cache_validity = 0;
>> if ((fattr->valid & NFS_ATTR_FATTR_MODE) == 0
>> diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
>> index fd61bf0fce63..2ac7a7be10e6 100644
>> --- a/fs/nfs/unlink.c
>> +++ b/fs/nfs/unlink.c
>> @@ -467,6 +467,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
>> sdentry = NULL;
>> do {
>> int slen;
>> + d_unlock_modify(sdentry);
>> dput(sdentry);
>> sillycounter++;
>> slen = scnprintf(silly, sizeof(silly),
>> @@ -484,7 +485,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
>> */
>> if (IS_ERR(sdentry))
>> goto out;
>> - } while (d_inode(sdentry) != NULL); /* need negative lookup */
>> + } while (!d_lock_create(sdentry)); /* need negative lookup */
>>
>> ihold(inode);
>>
>> @@ -529,6 +530,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
>> rpc_put_task(task);
>> out_dput:
>> iput(inode);
>> + d_unlock_modify(sdentry);
>> dput(sdentry);
>> out:
>> return error;
>> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
>> index ef4b70f64f33..befa52ec4f6b 100644
>> --- a/include/linux/dcache.h
>> +++ b/include/linux/dcache.h
>> @@ -12,7 +12,9 @@
>> #include <linux/rcupdate.h>
>> #include <linux/lockref.h>
>> #include <linux/stringhash.h>
>> +#include <linux/sched.h>
>> #include <linux/wait.h>
>> +#include <linux/wait_bit.h>
>>
>> struct path;
>> struct vfsmount;
>> @@ -216,6 +218,7 @@ struct dentry_operations {
>>
>> #define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */
>> #define DCACHE_DENTRY_CURSOR 0x20000000
>> +#define DCACHE_PAR_UPDATE 0x40000000 /* Being created or unlinked with shared lock */
>>
>> extern seqlock_t rename_lock;
>>
>> @@ -599,4 +602,44 @@ struct name_snapshot {
>> void take_dentry_name_snapshot(struct name_snapshot *, struct dentry *);
>> void release_dentry_name_snapshot(struct name_snapshot *);
>>
>> +/*
>> + * Lock a dentry in an S_PAR_UPDATE directory prior to creating
>> + * an object with that name. Will fail if the object is created
>> + * or instantiated while we waited.
>> + */
>> +static inline bool d_lock_create(struct dentry *dentry)
>> +{
>> + spin_lock(&dentry->d_lock);
>> + if (dentry->d_flags & DCACHE_PAR_UPDATE)
>> + ___wait_var_event(&dentry->d_flags,
>> + !(dentry->d_flags & DCACHE_PAR_UPDATE),
>> + TASK_UNINTERRUPTIBLE, 0, 0,
>> + (spin_unlock(&dentry->d_lock),
>> + schedule(),
>> + spin_lock(&dentry->d_lock))
>> + );
>> + if (d_inode(dentry) == NULL) {
>> + dentry->d_flags |= DCACHE_PAR_UPDATE;
>> + spin_unlock(&dentry->d_lock);
>> + return true;
>> + } else {
>> + spin_unlock(&dentry->d_lock);
>> + return false;
>> + }
>> +}
>> +bool d_lock_modify(struct dentry *dentry,
>> + struct dentry *base, const struct qstr *name);
>> +
>> +static inline void d_unlock_modify(struct dentry *dentry)
>> +{
>> + if (IS_ERR_OR_NULL(dentry))
>> + return;
>> + if (dentry->d_flags & DCACHE_PAR_UPDATE) {
>> + spin_lock(&dentry->d_lock);
>> + dentry->d_flags &= ~DCACHE_PAR_UPDATE;
>> + spin_unlock(&dentry->d_lock);
>> + wake_up_var(&dentry->d_flags);
>> + }
>> +}
>> +
>> #endif /* __LINUX_DCACHE_H */
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index c95c0807471f..15b1c3223438 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1947,6 +1947,7 @@ struct super_operations {
>> #define S_DAX 0 /* Make all the DAX code disappear */
>> #endif
>> #define S_ENCRYPTED 16384 /* Encrypted file (using fs/crypto/) */
>> +#define S_PAR_UPDATE 32768 /* Parallel directory operations supported */
>>
>> /*
>> * Note that nosuid etc flags are inode-specific: setting some file-system
>> --
>> 2.14.0.rc0.dirty
>>
>> <signature.asc>_______________________________________________
>> lustre-devel mailing list
>> lustre-devel at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
>
> Cheers, Andreas
> ---
> Andreas Dilger
> CTO Whamcloud
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181220/9416ecf1/attachment-0001.sig>
More information about the lustre-devel
mailing list