<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
Yes, I assume it’s a comment about stack usage (or even more out of date and unrelated to the current code entirely. :p) Like you, I can’t imagine what else it would be...<br>
<br>
You’re probably right about the locking issue, I’m happy to defer. (In any case, this timed out path should be very far from hot...)<br>
<br>
Otherwise that fix looks good. I’ll review the rest of these tomorrow.<br>
<br>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> NeilBrown <neilb@suse.com><br>
<b>Sent:</b> Monday, February 12, 2018 6:17:22 PM<br>
<b>To:</b> Patrick Farrell; Oleg Drokin; Andreas Dilger; James Simmons; Greg Kroah-Hartman<br>
<b>Cc:</b> lkml; lustre<br>
<b>Subject:</b> Re: [lustre-devel] [PATCH 08/19] staging: lustre: simplify waiting in ldlm_completion_ast()</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">On Mon, Feb 12 2018, Patrick Farrell wrote:<br>
<br>
> Neil,<br>
><br>
> I didn't get anything after 8/19 in this series. Is this just me? (I'd keep waiting, except I also found a few things in this patch.)<br>
<br>
Not just you. My fault. They are appearing now.<br>
<br>
><br>
> Minor:<br>
> The line XXX ALLOCATE is out of date and could go. (It refers to a<br>
> mix of things you eliminated and things that were already gone.)<br>
<br>
What does the line even mean? Some comment about stack usage?<br>
I think we have a look that looks for large stack frames. I wonder how<br>
to run it...<br>
<br>
><br>
> Less minor:<br>
> You remove use of the imp_lock when reading the connection count. While that'll work on x86, it's probably wrong on some architecture to read that without taking the lock...?<br>
<br>
It was my understanding that on all architectures which Linux support, a<br>
32bit aligned read is atomic wrt any 32bit write. I have trouble imagining how<br>
it could be otherwise.<br>
<br>
I probably should have highlighted the removal of the spinlock in the<br>
patch description though - it was intentional.<br>
<br>
><br>
> Bug:<br>
> The existing code uses the imp_conn_cnt from *before* the wait, rather<br>
> than after. I think that's quite important. So you'll want to read<br>
> it out before the wait. I think the main reason we'd hit the timeout<br>
> is a disconnect, which should cause a reconnect, so it's very<br>
> important to use the value from *before* the wait. (See comment on<br>
> ptlrpc_set_import_discon for more of an explanation. Basically it's<br>
> tracking a connection 'epoch', if it's changed, someone else already<br>
> went through the reconnect code for this 'connection epoch' and we<br>
> shouldn't start that process.) <br>
><br>
<br>
That wasn't intentional though - thanks for catching!<br>
Looking at ptlrpc_set_import_discon(), which is where the number<br>
eventually gets used, it is only used to compare with the new value of<br>
imp->imp_conn_cnt. <br>
<br>
This would fix both (assuming the locking issue needs fixing).<br>
<br>
Thanks,<br>
NeilBrown<br>
<br>
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c<br>
index f1233d844bbd..c3c9186b74ce 100644<br>
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c<br>
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c<br>
@@ -103,7 +103,7 @@ static int ldlm_request_bufsize(int count, int type)<br>
return sizeof(struct ldlm_request) + avail;<br>
}<br>
<br>
-static void ldlm_expired_completion_wait(struct ldlm_lock *lock, struct obd_import *imp2)<br>
+static void ldlm_expired_completion_wait(struct ldlm_lock *lock, __u32 conn_cnt)<br>
{<br>
struct obd_import *imp;<br>
struct obd_device *obd;<br>
@@ -129,7 +129,7 @@ static void ldlm_expired_completion_wait(struct ldlm_lock *lock, struct obd_impo<br>
<br>
obd = lock->l_conn_export->exp_obd;<br>
imp = obd->u.cli.cl_import;<br>
- ptlrpc_fail_import(imp, imp2 ? imp2->imp_conn_cnt : 0);<br>
+ ptlrpc_fail_import(imp, conn_cnt);<br>
LDLM_ERROR(lock,<br>
"lock timed out (enqueued at %lld, %llds ago), entering recovery for %s@%s",<br>
(s64)lock->l_last_activity,<br>
@@ -241,6 +241,7 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64 flags, void *data)<br>
struct obd_device *obd;<br>
struct obd_import *imp = NULL;<br>
__u32 timeout;<br>
+ __u32 conn_cnt = 0;<br>
int rc = 0;<br>
<br>
if (flags == LDLM_FL_WAIT_NOREPROC) {<br>
@@ -268,6 +269,11 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64 flags, void *data)<br>
<br>
lock->l_last_activity = ktime_get_real_seconds();<br>
<br>
+ if (imp) {<br>
+ spin_lock(&imp->imp_lock);<br>
+ conn_cnt = imp->imp_conn_cnt;<br>
+ spin_unlock(&imp->imp_lock);<br>
+ }<br>
if (OBD_FAIL_CHECK_RESET(OBD_FAIL_LDLM_INTR_CP_AST,<br>
OBD_FAIL_LDLM_CP_BL_RACE | OBD_FAIL_ONCE)) {<br>
ldlm_set_fail_loc(lock);<br>
@@ -280,7 +286,7 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64 flags, void *data)<br>
is_granted_or_cancelled(lock),<br>
timeout * HZ);<br>
if (rc == 0)<br>
- ldlm_expired_completion_wait(lock, imp);<br>
+ ldlm_expired_completion_wait(lock, conn_cnt);<br>
}<br>
/* Now wait abortable */<br>
if (rc == 0)<br>
</div>
</span></font></div>
</body>
</html>