[lustre-devel] [RPC] parallel directory operations for mainline Linux

Wed Dec 19 15:39:11 PST 2018

On Wed, Dec 19 2018, Andreas Dilger wrote:

> On Nov 22, 2018, at 21:44, NeilBrown <neilb at suse.com> wrote:
>> 
>> 
>> One of the remaining features of ldiskfs which is not in ext4fs is
>> parallel directory ops.
>> It would not be possible to get this upstream without VFS support
>> for parallel directory ops.  Lustre doesn't use the VFS interfaces
>> so this lack is not an immediate problem for lustre, but it is a
>> real problem for upstreaming.
>> 
>> This patch (which seems to work in my testing so far, but is probably
>> still buggy) adds VFS support for parallel dir ops - create and remove.
>> I haven't attempted rename - it would be complex for various reasons and
>> while I'm sure it is possible, I'm not sure it is worth the effort.
>> 
>> With this patch a filesystem can indicate that it supports parallel ops
>> by setting a flag on a directory.  The VFS will then get exclusive
>> access to the dentry - instead of the whole directory - when
>> performing the operation.
>> 
>> A filesystem which supports this much have its own locking to ensure
>> that lookup, readdir, create, unlink can all happen in parallel.
>> For NFS this is easy as the server takes care of those details, so
>> this patch also adds parallel-ops support for NFS.
>> For a filesystem like ext4 it would mean adding some locking to
>> the internal data structures.
>> 
>> I've had a bit of a look at the parallel-ops patch for ldiskfs and I
>> think it is over-engineered.  We don't need a new locking primitive.
>> 
>> I suspect I would start by adding a seqlock to each htree node.
>> This allows reads to proceed locklessly when no changes are happening
>> (if they are careful not to get confused by an inconsistent node).
>> A modification would normally find the relevant leaf with a similar
>> lockless walk, then lock the leaf, verify the seq-lock on the parent
>> hasn't changed, and perform the update.
>> In the rarer case when a leaf needs to split or merge something more
>> heavy handed would be needed - possibly lock the whole tree - possibly
>> just lock a higher node.
>> 
>> I don't expect to look at ext4 parallel ops in more detail in the
>> immediate future, and I don't plan to post this upstream until we have
>> credible support in ext4.  So I'm just posting it here now in case
>> anyone else want to explore how to make ext4 work with this.

Thanks for having a look.

>
> Why not start with the existing pdirops patch for ext4, which has been
> in production use for many years already and works very well?  Since
> this doesn't affect the on-disk structures at all, it would be
> possible to improve/replace it in the future if needed.

Yes, that would be a good place to start.  I suspect a lot of work would
be needed to make it upstreamable, but having a known-working base makes
sense.

>
> I had a quick pass through your patch, and it seems reasonable.
> The main question is whether the VFS interface is suitable for
> use with ext4 with (some) pdirops patch, or does it need some more
> changes?

This is exactly the question - can pdirops and the VFS interface meeting
in the middle?  What changes are required to each to make that work?
I'd like to dig into that question, but not for a while yet.

> Unfortunately, even with the pdirops patch applied, ext4
> does not use this locking directly from the VFS methods, since
> there was no reason to add overhead to the normal interface.  The
> htree locks are only supplied when calling into the code directly
> from Lustre.

Ideally, locking is cheap when there is no conflict - that is certainly
the aim of RCU and seq-locks.  Hopefully we can add the locking without
introducing noticeable cost.

Thanks,
NeilBrown

>
> Cheers, Andreas
>
>> From 827c01aee1cb74b72e5dbb2f40c01666b914bc15 Mon Sep 17 00:00:00 2001
>> From: NeilBrown <neilb at suse.com>
>> Date: Fri, 16 Nov 2018 19:58:53 +1100
>> Subject: [PATCH] VFS: support parallel updates in the one directory.
>> 
>> Some filesystems can support parallel modifications to a directory,
>> either because the modification happen on a remote server which does
>> its own locking (e.g. NFS) or because they can internally lock just
>> a part of a directory (e.g. many local filesystems, with a bit of
>> work).
>> 
>> To support these, we introduce support for parallel modification:
>> unlink (including rmdir) and create.
>> 
>> If a filesystem supports parallel modification in a given directory,
>> it sets S_PAR_UNLINK on the inode for that directory.
>> lookup_open() and the new lookup_hash_modify() (similar to
>> __lookup_hash()) notice the flag and take a shared
>> lock on the directory.
>> 
>> Once a dentry for the target name has been obtained,
>> DCACHE_PAR_UPDATE is set on it, waiting if necessary.
>> Once this is set, the thread has exclusive access to the
>> name and can call into the filesystem to perform
>> the required action.
>> 
>> Some files do *not* complete the lookup that precedes
>> a create, but leave the dentry d_in_lookup() and unhashed,
>> so often a dentry will have both DCACHE_PAR_LOOKUP and
>> DCACHE_PAR_UPDATE set at the same time.  To allow
>> for this, we need the 'wq' that is used when DCACHE_PAR_LOOKUP is
>> cleared, to exist until the creation is complete.  We also
>> need to re-initialize it if it might get re-used.
>> 
>> As NFS trivially supports parallel unlinks, this patch also adds the
>> flag to all NFS directories.
>> 
>> Signed-off-by: NeilBrown <neilb at suse.com>
>> ---
>> fs/dcache.c            |  37 ++++++++++
>> fs/namei.c             | 189 ++++++++++++++++++++++++++++++++++++++++++-------
>> fs/nfs/dir.c           |   2 +-
>> fs/nfs/inode.c         |   2 +
>> fs/nfs/unlink.c        |   4 +-
>> include/linux/dcache.h |  43 +++++++++++
>> include/linux/fs.h     |   1 +
>> 7 files changed, 249 insertions(+), 29 deletions(-)
>> 
>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index 2593153471cf..3821ce0bc37f 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -3030,6 +3030,43 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
>> }
>> EXPORT_SYMBOL(d_tmpfile);
>> 
>> +/*
>> + * Lock a dentry to unlink or create in an S_PAR_UPDATE directory.
>> + * After a successful return the dentry will not be modified by any other
>> + * thread, and still has the given name and parent.
>> + * On unsuccessful return it is not locked, because it was either unlinked
>> + * or renamed in the mean-time.  If it was instantiated or created,
>> + * we still return success, so caller might need to test if the dentry
>> + * is negative or positive.
>> + */
>> +bool d_lock_modify(struct dentry *dentry,
>> +		   struct dentry *base, const struct qstr *name)
>> +{
>> +	bool ret = true;
>> +
>> +	spin_lock(&dentry->d_lock);
>> +	if (dentry->d_flags & DCACHE_PAR_UPDATE)
>> +		___wait_var_event(&dentry->d_flags,
>> +				  !(dentry->d_flags & DCACHE_PAR_UPDATE),
>> +				  TASK_UNINTERRUPTIBLE, 0, 0,
>> +				  (spin_unlock(&dentry->d_lock),
>> +				   schedule(),
>> +				   spin_lock(&dentry->d_lock))
>> +			);
>> +	if (d_unhashed(dentry) && d_is_positive(dentry))
>> +		/* name was unlinked while we waited */
>> +		ret = false;
>> +	else if (dentry->d_parent != base ||
>> +		 dentry->d_name.hash != name->hash ||
>> +		 !d_same_name(dentry, base, name))
>> +		/* dentry was renamed - possibly silly-rename */
>> +		ret = false;
>> +	else
>> +		dentry->d_flags |= DCACHE_PAR_UPDATE;
>> +	spin_unlock(&dentry->d_lock);
>> +	return ret;
>> +}
>> +
>> static __initdata unsigned long dhash_entries;
>> static int __init set_dhash_entries(char *str)
>> {
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 0cab6494978c..ab6ccc03b9f4 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -1551,6 +1551,91 @@ static struct dentry *__lookup_hash(const struct qstr *name,
>> 	return dentry;
>> }
>> 
>> +/*
>> + * Parent directory is not locked.  We take either an exclusive
>> + * or shared lock depending on the fs preference, and then
>> + * get the DCACHE_PAR_UPDATE bit.
>> + */
>> +static struct dentry *lookup_hash_modify(const struct qstr *name,
>> +					 struct dentry *base, unsigned int flags,
>> +					 wait_queue_head_t *wq)
>> +{
>> +	struct dentry *dentry;
>> +	struct inode *dir = base->d_inode;
>> +	bool shared = (dir->i_flags & S_PAR_UPDATE) && wq;
>> +	int err;
>> +
>> +	if (shared)
>> +		inode_lock_shared_nested(dir, I_MUTEX_PARENT);
>> +	else
>> +		inode_lock_nested(dir, I_MUTEX_PARENT);
>> +
>> +retry:
>> +	dentry = lookup_dcache(name, base, flags);
>> +	if (!dentry) {
>> +		/* Don't create child dentry for a dead directory. */
>> +		err = -ENOENT;
>> +		if (unlikely(IS_DEADDIR(dir)))
>> +			goto out_err;
>> +
>> +		if (shared)
>> +			dentry = d_alloc_parallel(base, name, wq);
>> +		else
>> +			dentry = d_alloc(base, name);
>> +
>> +		if (!IS_ERR(dentry) &&
>> +		    (!shared || d_in_lookup(dentry))) {
>> +			struct dentry *old;
>> +
>> +			old = dir->i_op->lookup(dir, dentry, flags);
>> +			/*
>> +			 * Note: dentry might still be d_unhashed() and
>> +			 * d_in_lookup() if the fs will do the lookup
>> +			 * at 'create' time.
>> +			 */
>> +			if (unlikely(old)) {
>> +				d_lookup_done(dentry);
>> +				dput(dentry);
>> +				dentry = old;
>> +			}
>> +		}
>> +	}
>> +	if (IS_ERR(dentry)) {
>> +		err = PTR_ERR(dentry);
>> +		goto out_err;
>> +	}
>> +	if (!shared || d_lock_modify(dentry, base, name))
>> +		return dentry;
>> +
>> +	/* Failed to get lock due to race with unlink or rename */
>> +	d_lookup_done(dentry);
>> +	init_waitqueue_head(wq);
>> +	dput(dentry);
>> +	goto retry;
>> +
>> +out_err:
>> +	if (shared)
>> +		inode_unlock_shared(dir);
>> +	else
>> +		inode_unlock(dir);
>> +	return ERR_PTR(err);
>> +}
>> +
>> +static void lookup_done_modify(struct path *path, struct dentry *dentry,
>> +			       wait_queue_head_t *wq)
>> +{
>> +	struct inode *dir = path->dentry->d_inode;
>> +	bool shared = (dir->i_flags & S_PAR_UPDATE) && wq;
>> +
>> +	if (shared) {
>> +		d_lookup_done(dentry);
>> +		d_unlock_modify(dentry);
>> +		inode_unlock_shared(dir);
>> +	} else {
>> +		inode_unlock(dir);
>> +	}
>> +}
>> +
>> static int lookup_fast(struct nameidata *nd,
>> 		       struct path *path, struct inode **inode,
>> 		       unsigned *seqp)
>> @@ -3136,6 +3221,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>> 	int error, create_error = 0;
>> 	umode_t mode = op->mode;
>> 	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>> +	bool have_par_update = false;
>> 
>> 	if (unlikely(IS_DEADDIR(dir_inode)))
>> 		return -ENOENT;
>> @@ -3201,10 +3287,22 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>> 	}
>> 
>> 	if (dir_inode->i_op->atomic_open) {
>> +		/* dentry is negative or d_in_lookup().  If this is a shared-lock
>> +		 * create we need to get DCACHE_PAR_UPDATE to ensure exclusion
>> +		 */
>> +		if ((open_flag & O_CREAT) &&
>> +		    (dir->d_inode->i_flags & S_PAR_UPDATE)) {
>> +			if (!d_lock_create(dentry))
>> +				/* already exists, non-atomic open */
>> +				goto out_no_open;
>> +			have_par_update = true;
>> +		}
>> 		error = atomic_open(nd, dentry, path, file, op, open_flag,
>> 				    mode);
>> 		if (unlikely(error == -ENOENT) && create_error)
>> 			error = create_error;
>> +		if (have_par_update)
>> +			d_unlock_modify(dentry);
>> 		return error;
>> 	}
>> 
>> @@ -3222,6 +3320,13 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>> 			dentry = res;
>> 		}
>> 	}
>> +	/* If dentry is negative and this is a shared-lock
>> +	 * create we need to get DCACHE_PAR_UPDATE to ensure exclusion
>> +	 */
>> +	if ((open_flag & O_CREAT) &&
>> +	    !dentry->d_inode &&
>> +	    (dir->d_inode->i_flags & S_PAR_UPDATE))
>> +		have_par_update = d_lock_create(dentry);
>> 
>> 	/* Negative dentry, just create the file */
>> 	if (!dentry->d_inode && (open_flag & O_CREAT)) {
>> @@ -3242,11 +3347,15 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>> 		goto out_dput;
>> 	}
>> out_no_open:
>> +	if (have_par_update)
>> +		d_unlock_modify(dentry);
>> 	path->dentry = dentry;
>> 	path->mnt = nd->path.mnt;
>> 	return 0;
>> 
>> out_dput:
>> +	if (have_par_update)
>> +		d_unlock_modify(dentry);
>> 	dput(dentry);
>> 	return error;
>> }
>> @@ -3266,6 +3375,7 @@ static int do_last(struct nameidata *nd,
>> 	struct inode *inode;
>> 	struct path path;
>> 	int error;
>> +	bool shared;
>> 
>> 	nd->flags &= ~LOOKUP_PARENT;
>> 	nd->flags |= op->intent;
>> @@ -3317,12 +3427,13 @@ static int do_last(struct nameidata *nd,
>> 		 * dropping this one anyway.
>> 		 */
>> 	}
>> -	if (open_flag & O_CREAT)
>> +	shared = !!(dir->d_inode->i_flags & S_PAR_UPDATE);
>> +	if ((open_flag & O_CREAT) && !shared)
>> 		inode_lock(dir->d_inode);
>> 	else
>> 		inode_lock_shared(dir->d_inode);
>> 	error = lookup_open(nd, &path, file, op, got_write);
>> -	if (open_flag & O_CREAT)
>> +	if ((open_flag & O_CREAT) && !shared)
>> 		inode_unlock(dir->d_inode);
>> 	else
>> 		inode_unlock_shared(dir->d_inode);
>> @@ -3600,7 +3711,8 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
>> }
>> 
>> static struct dentry *filename_create(int dfd, struct filename *name,
>> -				struct path *path, unsigned int lookup_flags)
>> +				      struct path *path, unsigned int lookup_flags,
>> +				      wait_queue_head_t *wq)
>> {
>> 	struct dentry *dentry = ERR_PTR(-EEXIST);
>> 	struct qstr last;
>> @@ -3632,8 +3744,7 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>> 	 * Do the final lookup.
>> 	 */
>> 	lookup_flags |= LOOKUP_CREATE | LOOKUP_EXCL;
>> -	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
>> -	dentry = __lookup_hash(&last, path->dentry, lookup_flags);
>> +	dentry = lookup_hash_modify(&last, path->dentry, lookup_flags, wq);
>> 	if (IS_ERR(dentry))
>> 		goto unlock;
>> 
>> @@ -3658,10 +3769,10 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>> 	putname(name);
>> 	return dentry;
>> fail:
>> +	lookup_done_modify(path, dentry, wq);
>> 	dput(dentry);
>> 	dentry = ERR_PTR(error);
>> unlock:
>> -	inode_unlock(path->dentry->d_inode);
>> 	if (!err2)
>> 		mnt_drop_write(path->mnt);
>> out:
>> @@ -3674,23 +3785,34 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
>> 				struct path *path, unsigned int lookup_flags)
>> {
>> 	return filename_create(dfd, getname_kernel(pathname),
>> -				path, lookup_flags);
>> +			       path, lookup_flags, NULL);
>> }
>> EXPORT_SYMBOL(kern_path_create);
>> 
>> -void done_path_create(struct path *path, struct dentry *dentry)
>> +void __done_path_create(struct path *path, struct dentry *dentry,
>> +			wait_queue_head_t *wq)
>> {
>> +	lookup_done_modify(path, dentry, wq);
>> 	dput(dentry);
>> -	inode_unlock(path->dentry->d_inode);
>> 	mnt_drop_write(path->mnt);
>> 	path_put(path);
>> }
>> +void done_path_create(struct path *path, struct dentry *dentry)
>> +{
>> +	__done_path_create(path, dentry, NULL);
>> +}
>> EXPORT_SYMBOL(done_path_create);
>> 
>> +inline struct dentry *user_path_create_wq(int dfd, const char __user *pathname,
>> +					  struct path *path, unsigned int lookup_flags,
>> +					  wait_queue_head_t *wq)
>> +{
>> +	return filename_create(dfd, getname(pathname), path, lookup_flags, wq);
>> +}
>> inline struct dentry *user_path_create(int dfd, const char __user *pathname,
>> -				struct path *path, unsigned int lookup_flags)
>> +				       struct path *path, unsigned int lookup_flags)
>> {
>> -	return filename_create(dfd, getname(pathname), path, lookup_flags);
>> +	return filename_create(dfd, getname(pathname), path, lookup_flags, NULL);
>> }
>> EXPORT_SYMBOL(user_path_create);
>> 
>> @@ -3747,12 +3869,13 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>> 	struct path path;
>> 	int error;
>> 	unsigned int lookup_flags = 0;
>> +	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>> 
>> 	error = may_mknod(mode);
>> 	if (error)
>> 		return error;
>> retry:
>> -	dentry = user_path_create(dfd, filename, &path, lookup_flags);
>> +	dentry = user_path_create_wq(dfd, filename, &path, lookup_flags, &wq);
>> 	if (IS_ERR(dentry))
>> 		return PTR_ERR(dentry);
>> 
>> @@ -3776,7 +3899,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>> 			break;
>> 	}
>> out:
>> -	done_path_create(&path, dentry);
>> +	__done_path_create(&path, dentry, &wq);
>> 	if (retry_estale(error, lookup_flags)) {
>> 		lookup_flags |= LOOKUP_REVAL;
>> 		goto retry;
>> @@ -3827,9 +3950,10 @@ long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
>> 	struct path path;
>> 	int error;
>> 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
>> +	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>> 
>> retry:
>> -	dentry = user_path_create(dfd, pathname, &path, lookup_flags);
>> +	dentry = user_path_create_wq(dfd, pathname, &path, lookup_flags, &wq);
>> 	if (IS_ERR(dentry))
>> 		return PTR_ERR(dentry);
>> 
>> @@ -3838,7 +3962,7 @@ long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
>> 	error = security_path_mkdir(&path, dentry, mode);
>> 	if (!error)
>> 		error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
>> -	done_path_create(&path, dentry);
>> +	__done_path_create(&path, dentry, &wq);
>> 	if (retry_estale(error, lookup_flags)) {
>> 		lookup_flags |= LOOKUP_REVAL;
>> 		goto retry;
>> @@ -3904,6 +4028,7 @@ long do_rmdir(int dfd, const char __user *pathname)
>> 	struct qstr last;
>> 	int type;
>> 	unsigned int lookup_flags = 0;
>> +	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>> retry:
>> 	name = filename_parentat(dfd, getname(pathname), lookup_flags,
>> 				&path, &last, &type);
>> @@ -3926,8 +4051,7 @@ long do_rmdir(int dfd, const char __user *pathname)
>> 	if (error)
>> 		goto exit1;
>> 
>> -	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
>> -	dentry = __lookup_hash(&last, path.dentry, lookup_flags);
>> +	dentry = lookup_hash_modify(&last, path.dentry, lookup_flags, &wq);
>> 	error = PTR_ERR(dentry);
>> 	if (IS_ERR(dentry))
>> 		goto exit2;
>> @@ -3940,9 +4064,9 @@ long do_rmdir(int dfd, const char __user *pathname)
>> 		goto exit3;
>> 	error = vfs_rmdir(path.dentry->d_inode, dentry);
>> exit3:
>> +	lookup_done_modify(&path, dentry, &wq);
>> 	dput(dentry);
>> exit2:
>> -	inode_unlock(path.dentry->d_inode);
>> 	mnt_drop_write(path.mnt);
>> exit1:
>> 	path_put(&path);
>> @@ -3965,7 +4089,8 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
>>  * @dentry:	victim
>>  * @delegated_inode: returns victim inode, if the inode is delegated.
>>  *
>> - * The caller must hold dir->i_mutex.
>> + * The caller must either hold a write-lock on dir->i_rwsem, or
>> + * a readlock having atomically cleared DCACHE_PAR_UNLINK.
>>  *
>>  * If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and
>>  * return a reference to the inode in delegated_inode.  The caller
>> @@ -4022,6 +4147,11 @@ EXPORT_SYMBOL(vfs_unlink);
>>  * directory's i_mutex.  Truncate can take a long time if there is a lot of
>>  * writeout happening, and we don't want to prevent access to the directory
>>  * while waiting on the I/O.
>> + * If both the directory and the target dentry have DCACHE_PAR_UNLINK set,
>> + * we do the unlink with a shared lock on the directory while clearing
>> + * DCACHE_PAR_UNLINK on the target.  IF the flags is not set, then either
>> + * the filesystem doesn't support parallel unlinks, or we are racing with
>> + * another thread unlinking the same name.
>>  */
>> long do_unlinkat(int dfd, struct filename *name)
>> {
>> @@ -4033,6 +4163,7 @@ long do_unlinkat(int dfd, struct filename *name)
>> 	struct inode *inode = NULL;
>> 	struct inode *delegated_inode = NULL;
>> 	unsigned int lookup_flags = 0;
>> +	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>> retry:
>> 	name = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
>> 	if (IS_ERR(name))
>> @@ -4045,9 +4176,9 @@ long do_unlinkat(int dfd, struct filename *name)
>> 	error = mnt_want_write(path.mnt);
>> 	if (error)
>> 		goto exit1;
>> +
>> retry_deleg:
>> -	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
>> -	dentry = __lookup_hash(&last, path.dentry, lookup_flags);
>> +	dentry = lookup_hash_modify(&last, path.dentry, lookup_flags, &wq);
>> 	error = PTR_ERR(dentry);
>> 	if (!IS_ERR(dentry)) {
>> 		/* Why not before? Because we want correct error value */
>> @@ -4062,14 +4193,15 @@ long do_unlinkat(int dfd, struct filename *name)
>> 			goto exit2;
>> 		error = vfs_unlink(path.dentry->d_inode, dentry, &delegated_inode);
>> exit2:
>> +		lookup_done_modify(&path, dentry, &wq);
>> 		dput(dentry);
>> 	}
>> -	inode_unlock(path.dentry->d_inode);
>> 	if (inode)
>> 		iput(inode);	/* truncate the inode here */
>> 	inode = NULL;
>> 	if (delegated_inode) {
>> 		error = break_deleg_wait(&delegated_inode);
>> +		init_waitqueue_head(&wq);
>> 		if (!error)
>> 			goto retry_deleg;
>> 	}
>> @@ -4079,6 +4211,7 @@ long do_unlinkat(int dfd, struct filename *name)
>> 	if (retry_estale(error, lookup_flags)) {
>> 		lookup_flags |= LOOKUP_REVAL;
>> 		inode = NULL;
>> +		init_waitqueue_head(&wq);
>> 		goto retry;
>> 	}
>> 	putname(name);
>> @@ -4139,12 +4272,13 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>> 	struct dentry *dentry;
>> 	struct path path;
>> 	unsigned int lookup_flags = 0;
>> +	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>> 
>> 	from = getname(oldname);
>> 	if (IS_ERR(from))
>> 		return PTR_ERR(from);
>> retry:
>> -	dentry = user_path_create(newdfd, newname, &path, lookup_flags);
>> +	dentry = user_path_create_wq(newdfd, newname, &path, lookup_flags, &wq);
>> 	error = PTR_ERR(dentry);
>> 	if (IS_ERR(dentry))
>> 		goto out_putname;
>> @@ -4152,7 +4286,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>> 	error = security_path_symlink(&path, dentry, from->name);
>> 	if (!error)
>> 		error = vfs_symlink(path.dentry->d_inode, dentry, from->name);
>> -	done_path_create(&path, dentry);
>> +	__done_path_create(&path, dentry, &wq);
>> 	if (retry_estale(error, lookup_flags)) {
>> 		lookup_flags |= LOOKUP_REVAL;
>> 		goto retry;
>> @@ -4270,6 +4404,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>> 	struct inode *delegated_inode = NULL;
>> 	int how = 0;
>> 	int error;
>> +	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>> 
>> 	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0)
>> 		return -EINVAL;
>> @@ -4291,8 +4426,8 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>> 	if (error)
>> 		return error;
>> 
>> -	new_dentry = user_path_create(newdfd, newname, &new_path,
>> -					(how & LOOKUP_REVAL));
>> +	new_dentry = user_path_create_wq(newdfd, newname, &new_path,
>> +				      (how & LOOKUP_REVAL), &wq);
>> 	error = PTR_ERR(new_dentry);
>> 	if (IS_ERR(new_dentry))
>> 		goto out;
>> @@ -4308,7 +4443,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>> 		goto out_dput;
>> 	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
>> out_dput:
>> -	done_path_create(&new_path, new_dentry);
>> +	__done_path_create(&new_path, new_dentry, &wq);
>> 	if (delegated_inode) {
>> 		error = break_deleg_wait(&delegated_inode);
>> 		if (!error) {
>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> index 71b2e390becf..9a3c51f3040b 100644
>> --- a/fs/nfs/dir.c
>> +++ b/fs/nfs/dir.c
>> @@ -57,7 +57,7 @@ static void nfs_readdir_clear_array(struct page*);
>> const struct file_operations nfs_dir_operations = {
>> 	.llseek		= nfs_llseek_dir,
>> 	.read		= generic_read_dir,
>> -	.iterate	= nfs_readdir,
>> +	.iterate_shared	= nfs_readdir,
>> 	.open		= nfs_opendir,
>> 	.release	= nfs_closedir,
>> 	.fsync		= nfs_fsync_dir,
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index 5b1eee4952b7..9a6dac08cc79 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -453,6 +453,8 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st
>> 
>> 		/* We can't support update_atime(), since the server will reset it */
>> 		inode->i_flags |= S_NOATIME|S_NOCMTIME;
>> +		/* Parallel updates to directory are trivial */
>> +		inode->i_flags |= S_PAR_UPDATE;
>> 		inode->i_mode = fattr->mode;
>> 		nfsi->cache_validity = 0;
>> 		if ((fattr->valid & NFS_ATTR_FATTR_MODE) == 0
>> diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
>> index fd61bf0fce63..2ac7a7be10e6 100644
>> --- a/fs/nfs/unlink.c
>> +++ b/fs/nfs/unlink.c
>> @@ -467,6 +467,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
>> 	sdentry = NULL;
>> 	do {
>> 		int slen;
>> +		d_unlock_modify(sdentry);
>> 		dput(sdentry);
>> 		sillycounter++;
>> 		slen = scnprintf(silly, sizeof(silly),
>> @@ -484,7 +485,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
>> 		 */
>> 		if (IS_ERR(sdentry))
>> 			goto out;
>> -	} while (d_inode(sdentry) != NULL); /* need negative lookup */
>> +	} while (!d_lock_create(sdentry)); /* need negative lookup */
>> 
>> 	ihold(inode);
>> 
>> @@ -529,6 +530,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
>> 	rpc_put_task(task);
>> out_dput:
>> 	iput(inode);
>> +	d_unlock_modify(sdentry);
>> 	dput(sdentry);
>> out:
>> 	return error;
>> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
>> index ef4b70f64f33..befa52ec4f6b 100644
>> --- a/include/linux/dcache.h
>> +++ b/include/linux/dcache.h
>> @@ -12,7 +12,9 @@
>> #include <linux/rcupdate.h>
>> #include <linux/lockref.h>
>> #include <linux/stringhash.h>
>> +#include <linux/sched.h>
>> #include <linux/wait.h>
>> +#include <linux/wait_bit.h>
>> 
>> struct path;
>> struct vfsmount;
>> @@ -216,6 +218,7 @@ struct dentry_operations {
>> 
>> #define DCACHE_PAR_LOOKUP		0x10000000 /* being looked up (with parent locked shared) */
>> #define DCACHE_DENTRY_CURSOR		0x20000000
>> +#define DCACHE_PAR_UPDATE		0x40000000 /* Being created or unlinked with shared lock */
>> 
>> extern seqlock_t rename_lock;
>> 
>> @@ -599,4 +602,44 @@ struct name_snapshot {
>> void take_dentry_name_snapshot(struct name_snapshot *, struct dentry *);
>> void release_dentry_name_snapshot(struct name_snapshot *);
>> 
>> +/*
>> + * Lock a dentry in an S_PAR_UPDATE directory prior to creating
>> + * an object with that name.  Will fail if the object is created
>> + * or instantiated while we waited.
>> + */
>> +static inline bool d_lock_create(struct dentry *dentry)
>> +{
>> +	spin_lock(&dentry->d_lock);
>> +	if (dentry->d_flags & DCACHE_PAR_UPDATE)
>> +		___wait_var_event(&dentry->d_flags,
>> +				  !(dentry->d_flags & DCACHE_PAR_UPDATE),
>> +				  TASK_UNINTERRUPTIBLE, 0, 0,
>> +				  (spin_unlock(&dentry->d_lock),
>> +				   schedule(),
>> +				   spin_lock(&dentry->d_lock))
>> +			);
>> +	if (d_inode(dentry) == NULL) {
>> +		dentry->d_flags |= DCACHE_PAR_UPDATE;
>> +		spin_unlock(&dentry->d_lock);
>> +		return true;
>> +	} else {
>> +		spin_unlock(&dentry->d_lock);
>> +		return false;
>> +	}
>> +}
>> +bool d_lock_modify(struct dentry *dentry,
>> +		   struct dentry *base, const struct qstr *name);
>> +
>> +static inline void d_unlock_modify(struct dentry *dentry)
>> +{
>> +	if (IS_ERR_OR_NULL(dentry))
>> +		return;
>> +	if (dentry->d_flags & DCACHE_PAR_UPDATE) {
>> +		spin_lock(&dentry->d_lock);
>> +		dentry->d_flags &= ~DCACHE_PAR_UPDATE;
>> +		spin_unlock(&dentry->d_lock);
>> +		wake_up_var(&dentry->d_flags);
>> +	}
>> +}
>> +
>> #endif	/* __LINUX_DCACHE_H */
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index c95c0807471f..15b1c3223438 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1947,6 +1947,7 @@ struct super_operations {
>> #define S_DAX		0	/* Make all the DAX code disappear */
>> #endif
>> #define S_ENCRYPTED	16384	/* Encrypted file (using fs/crypto/) */
>> +#define S_PAR_UPDATE	32768	/* Parallel directory operations supported */
>> 
>> /*
>>  * Note that nosuid etc flags are inode-specific: setting some file-system
>> -- 
>> 2.14.0.rc0.dirty
>> 
>> <signature.asc>_______________________________________________
>> lustre-devel mailing list
>> lustre-devel at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
>
> Cheers, Andreas
> ---
> Andreas Dilger
> CTO Whamcloud
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181220/9416ecf1/attachment-0001.sig>