[lustre-devel] [PATCH 16/28] lustre: statahead: missing barrier before wake_up
James Simmons
jsimmons at infradead.org
Sun Nov 4 12:52:17 PST 2018
> On Sun, Oct 21 2018, James Simmons wrote:
>
> >> On Sun, Oct 14 2018, James Simmons wrote:
> >>
> >> > From: Lai Siyao <lai.siyao at whamcloud.com>
> >> >
> >> > A barrier is missing before wake_up() in ll_statahead_interpret(),
> >> > which may cause 'ls' hang. Under the right conditions a basic 'ls'
> >> > can fail. The debug logs show:
> >> >
> >> > statahead.c:683:ll_statahead_interpret()) sa_entry software rc -13
> >> > statahead.c:1666:ll_statahead()) revalidate statahead software: -11.
> >> >
> >> > Obviously statahead failure didn't notify 'ls' process in time.
> >> > The mi_cbdata can be stale so add a barrier before calling
> >> > wake_up().
> >> >
> >> > Signed-off-by: Lai Siyao <lai.siyao at whamcloud.com>
> >> > Signed-off-by: Bob Glossman <bob.glossman at intel.com>
> >> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9210
> >> > Reviewed-on: https://review.whamcloud.com/27330
> >> > Reviewed-by: Nathaniel Clark <nclark at whamcloud.com>
> >> > Reviewed-by: Oleg Drokin <green at whamcloud.com>
> >> > Signed-off-by: James Simmons <jsimmons at infradead.org>
> >> > ---
> >> > drivers/staging/lustre/lustre/llite/statahead.c | 8 +++++++-
> >> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/drivers/staging/lustre/lustre/llite/statahead.c b/drivers/staging/lustre/lustre/llite/statahead.c
> >> > index 1ad308c..0174a4c 100644
> >> > --- a/drivers/staging/lustre/lustre/llite/statahead.c
> >> > +++ b/drivers/staging/lustre/lustre/llite/statahead.c
> >> > @@ -680,8 +680,14 @@ static int ll_statahead_interpret(struct ptlrpc_request *req,
> >> >
> >> > spin_lock(&lli->lli_sa_lock);
> >> > if (rc) {
> >> > - if (__sa_make_ready(sai, entry, rc))
> >> > + if (__sa_make_ready(sai, entry, rc)) {
> >> > + /* LU-9210 : Under the right conditions even 'ls'
> >> > + * can cause the statahead to fail. Using a memory
> >> > + * barrier resolves this issue.
> >> > + */
> >> > + smp_mb();
> >> > wake_up(&sai->sai_waitq);
> >> > + }
> >> > } else {
> >> > int first = 0;
> >> > entry->se_minfo = minfo;
> >> > --
> >> > 1.8.3.1
> >>
> >> Again, this is a fairly lame comment to justify the smp_mb().
> >> It appears to me that the issue is most likely the value of
> >> entry->se_state.
> >> __sa_make_ready() sets this and revalidate_statahead_dentry tests it
> >> after waiting on sai_waitq.
> >> So I think it would be best if we changed __sa_make_ready() to
> >>
> >> smp_store_release(&entry->se_state, ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC)
> >>
> >> and in ll_statahead_interpret() have
> >>
> >> if (smp_load_acquire(&entry->se_state) == SA_ENTRY_SUCC &&
> >> entry->se_inode) {
> >>
> >> This would make it obvious which variable was important, and would show
> >> the paired synchronization points.
> >
> > If you think this is lame you should be the JIRA ticket and the original
> > patch. It had zero info so I attempted to extract what I could out of the
> > ticket. Hopefully Lai can fill in the details. I have no problems fixing
> > this another way. I don't see a way in the ticket to easily reproduce this
> > problem to see the new approach would fix it :-(
>
> I have imposed the version that I think is correct. See below.
I opened a ticket - https://jira.whamcloud.com/browse/LU-11616 and pushed
this to the OpenSFS branch for full testing. The patch is at:
https://review.whamcloud.com/#/c/33571
BTW I did not know about (total store order) TSO platforms and how some
architectures don't support this property. An audit of the smp barriers
might be a good idea for Lustre. For people not aware of TSO a good
article on this is at:
https://lwn.net/Articles/576486
> Thanks,
> NeilBrown
>
> --- a/drivers/staging/lustre/lustre/llite/statahead.c
> +++ b/drivers/staging/lustre/lustre/llite/statahead.c
> @@ -322,7 +322,11 @@ __sa_make_ready(struct ll_statahead_info *sai, struct sa_entry *entry, int ret)
> }
> }
> list_add(&entry->se_list, pos);
> - entry->se_state = ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC;
> + /*
> + * LU-9210: ll_statahead_interpet must be able to see this before
> + * we wake it up
> + */
> + smp_store_release(&entry->se_state, ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC);
>
> return (index == sai->sai_index_wait);
> }
> @@ -1390,7 +1394,12 @@ static int revalidate_statahead_dentry(struct inode *dir,
> }
> }
>
> - if (entry->se_state == SA_ENTRY_SUCC && entry->se_inode) {
> + /*
> + * We need to see the value that was set immediately before we
> + * were woken up.
> + */
> + if (smp_load_acquire(&entry->se_state) == SA_ENTRY_SUCC &&
> + entry->se_inode) {
> struct inode *inode = entry->se_inode;
> struct lookup_intent it = { .it_op = IT_GETATTR,
> .it_lock_handle = entry->se_handle };
>
More information about the lustre-devel
mailing list