<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
Yes, I assume it’s a comment about stack usage (or even more out of date and unrelated to the current code entirely. :p)  Like you, I can’t imagine what else it would be...<br>
<br>
You’re probably right about the locking issue, I’m happy to defer.  (In any case, this timed out path should be very far from hot...)<br>
<br>
Otherwise that fix looks good.  I’ll review the rest of these tomorrow.<br>
<br>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> NeilBrown <neilb@suse.com><br>
<b>Sent:</b> Monday, February 12, 2018 6:17:22 PM<br>
<b>To:</b> Patrick Farrell; Oleg Drokin; Andreas Dilger; James Simmons; Greg Kroah-Hartman<br>
<b>Cc:</b> lkml; lustre<br>
<b>Subject:</b> Re: [lustre-devel] [PATCH 08/19] staging: lustre: simplify waiting in ldlm_completion_ast()</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">On Mon, Feb 12 2018, Patrick Farrell wrote:<br>
<br>
> Neil,<br>
><br>
> I didn't get anything after 8/19 in this series.  Is this just me?  (I'd keep waiting, except I also found a few things in this patch.)<br>
<br>
Not just you.  My fault.  They are appearing now.<br>
<br>
><br>
> Minor:<br>
> The line XXX ALLOCATE is out of date and could go.  (It refers to a<br>
> mix of things you eliminated and things that were already gone.)<br>
<br>
What does the line even mean?  Some comment about stack usage?<br>
I think we have a look that looks for large stack frames.  I wonder how<br>
to run it...<br>
<br>
><br>
> Less minor:<br>
> You remove use of the imp_lock when reading the connection count.  While that'll work on x86, it's probably wrong on some architecture to read that without taking the lock...?<br>
<br>
It was my understanding that on all architectures which Linux support, a<br>
32bit aligned read is atomic wrt any 32bit write.  I have trouble imagining how<br>
it could be otherwise.<br>
<br>
I probably should have highlighted the removal of the spinlock in the<br>
patch description though - it was intentional.<br>
<br>
><br>
> Bug:<br>
> The existing code uses the imp_conn_cnt from *before* the wait, rather<br>
> than after.  I think that's quite important.  So you'll want to read<br>
> it out before the wait.  I think the main reason we'd hit the timeout<br>
> is a disconnect, which should cause a reconnect, so it's very<br>
> important to use the value from *before* the wait.  (See comment on<br>
> ptlrpc_set_import_discon for more of an explanation.  Basically it's<br>
> tracking a connection 'epoch', if it's changed, someone else already<br>
> went through the reconnect code for this 'connection epoch' and we<br>
> shouldn't start that process.) <br>
><br>
<br>
That wasn't intentional though - thanks for catching!<br>
Looking at  ptlrpc_set_import_discon(), which is where the number<br>
eventually gets used, it is only used to compare with the new value of<br>
imp->imp_conn_cnt. <br>
<br>
This would fix both (assuming the locking issue needs fixing).<br>
<br>
Thanks,<br>
NeilBrown<br>
<br>
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c<br>
index f1233d844bbd..c3c9186b74ce 100644<br>
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c<br>
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c<br>
@@ -103,7 +103,7 @@ static int ldlm_request_bufsize(int count, int type)<br>
         return sizeof(struct ldlm_request) + avail;<br>
 }<br>
 <br>
-static void ldlm_expired_completion_wait(struct ldlm_lock *lock, struct obd_import *imp2)<br>
+static void ldlm_expired_completion_wait(struct ldlm_lock *lock, __u32 conn_cnt)<br>
 {<br>
         struct obd_import *imp;<br>
         struct obd_device *obd;<br>
@@ -129,7 +129,7 @@ static void ldlm_expired_completion_wait(struct ldlm_lock *lock, struct obd_impo<br>
 <br>
         obd = lock->l_conn_export->exp_obd;<br>
         imp = obd->u.cli.cl_import;<br>
-       ptlrpc_fail_import(imp, imp2 ? imp2->imp_conn_cnt : 0);<br>
+       ptlrpc_fail_import(imp, conn_cnt);<br>
         LDLM_ERROR(lock,<br>
                    "lock timed out (enqueued at %lld, %llds ago), entering recovery for %s@%s",<br>
                    (s64)lock->l_last_activity,<br>
@@ -241,6 +241,7 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64 flags, void *data)<br>
         struct obd_device *obd;<br>
         struct obd_import *imp = NULL;<br>
         __u32 timeout;<br>
+       __u32 conn_cnt = 0;<br>
         int rc = 0;<br>
 <br>
         if (flags == LDLM_FL_WAIT_NOREPROC) {<br>
@@ -268,6 +269,11 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64 flags, void *data)<br>
 <br>
         lock->l_last_activity = ktime_get_real_seconds();<br>
 <br>
+       if (imp) {<br>
+               spin_lock(&imp->imp_lock);<br>
+               conn_cnt = imp->imp_conn_cnt;<br>
+               spin_unlock(&imp->imp_lock);<br>
+       }<br>
         if (OBD_FAIL_CHECK_RESET(OBD_FAIL_LDLM_INTR_CP_AST,<br>
                                  OBD_FAIL_LDLM_CP_BL_RACE | OBD_FAIL_ONCE)) {<br>
                 ldlm_set_fail_loc(lock);<br>
@@ -280,7 +286,7 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64 flags, void *data)<br>
                                                      is_granted_or_cancelled(lock),<br>
                                                      timeout * HZ);<br>
                         if (rc == 0)<br>
-                               ldlm_expired_completion_wait(lock, imp);<br>
+                               ldlm_expired_completion_wait(lock, conn_cnt);<br>
                 }<br>
                 /* Now wait abortable */<br>
                 if (rc == 0)<br>
</div>
</span></font></div>
</body>
</html>