[lustre-devel] [PATCH 326/622] lnet: Ensure md is detached when msg is not committed

James Simmons jsimmons at infradead.org
Thu Feb 27 13:13:14 PST 2020


From: Chris Horn <hornc at cray.com>

It's possible for lnet_is_health_check() to return "true" when the
message has not hit the network. In this situation the message is
freed without detaching the MD. As a result, requests do not receive
their unlink events and these requests are stuck forever.

A little cleanup is included here:
 - The value of lnet_is_health_check() is only used in one place, so
   we don't need to save the result of it in a variable.
 - We don't need separate logic to detach the md when the send was
   successful. We'll fall through to the finalizing code after
   incrementing the health counters

Cray-bug-id: LUS-7239
WC-bug-id: https://jira.whamcloud.com/browse/LU-12199
Lustre-commit: b65f3a1767ae ("LU-12199 lnet: Ensure md is detached when msg is not committed")
Signed-off-by: Chris Horn <hornc at cray.com>
Reviewed-on: https://review.whamcloud.com/34885
Reviewed-by: Olaf Weber <olaf.weber at hpe.com>
Reviewed-by: Amir Shehata <ashehata at whamcloud.com>
Signed-off-by: James Simmons <jsimmons at infradead.org>
---
 net/lnet/lnet/lib-msg.c | 66 +++++++++++++++----------------------------------
 1 file changed, 20 insertions(+), 46 deletions(-)

diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index ad35c3d..dbd8de4 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -784,16 +784,6 @@
 	msg->msg_md = NULL;
 }
 
-static void
-lnet_detach_md(struct lnet_msg *msg, int status)
-{
-	int cpt = lnet_cpt_of_cookie(msg->msg_md->md_lh.lh_cookie);
-
-	lnet_res_lock(cpt);
-	lnet_msg_detach_md(msg, cpt, status);
-	lnet_res_unlock(cpt);
-}
-
 static bool
 lnet_is_health_check(struct lnet_msg *msg)
 {
@@ -881,7 +871,6 @@
 	int cpt;
 	int rc;
 	int i;
-	bool hc;
 
 	LASSERT(!in_interrupt());
 
@@ -890,36 +879,7 @@
 
 	msg->msg_ev.status = status;
 
-	/* if the message is successfully sent, no need to keep the MD around */
-	if (msg->msg_md && !status)
-		lnet_detach_md(msg, status);
-
-again:
-	hc = lnet_is_health_check(msg);
-
-	/* the MD would've been detached from the message if it was
-	 * successfully sent. However, if it wasn't successfully sent the
-	 * MD would be around. And since we recalculate whether to
-	 * health check or not, it's possible that we change our minds and
-	 * we don't want to health check this message. In this case also
-	 * free the MD.
-	 *
-	 * If the message is successful we're going to
-	 * go through the lnet_health_check() function, but that'll just
-	 * increment the appropriate health value and return.
-	 */
-	if (msg->msg_md && !hc)
-		lnet_detach_md(msg, status);
-
-	rc = 0;
-	if (!msg->msg_tx_committed && !msg->msg_rx_committed) {
-		/* not committed to network yet */
-		LASSERT(!msg->msg_onactivelist);
-		kfree(msg);
-		return;
-	}
-
-	if (hc) {
+	if (lnet_is_health_check(msg)) {
 		/* Check the health status of the message. If it has one
 		 * of the errors that we're supposed to handle, and it has
 		 * not timed out, then
@@ -932,13 +892,26 @@
 		 * put on the resend queue.
 		 */
 		if (!lnet_health_check(msg))
+			/* Message is queued for resend */
 			return;
+	}
 
-		/* if we get here then we need to clean up the md because we're
-		 * finalizing the message.
-		 */
-		if (msg->msg_md)
-			lnet_detach_md(msg, status);
+	/* We're not going to resend this message so detach its MD and invoke
+	 * the appropriate callbacks
+	 */
+	if (msg->msg_md) {
+		cpt = lnet_cpt_of_cookie(msg->msg_md->md_lh.lh_cookie);
+		lnet_res_lock(cpt);
+		lnet_msg_detach_md(msg, cpt, status);
+		lnet_res_unlock(cpt);
+	}
+
+again:
+	if (!msg->msg_tx_committed && !msg->msg_rx_committed) {
+		/* not committed to network yet */
+		LASSERT(!msg->msg_onactivelist);
+		kfree(msg);
+		return;
 	}
 
 	/*
@@ -972,6 +945,7 @@
 
 	container->msc_finalizers[my_slot] = current;
 
+	rc = 0;
 	while ((msg = list_first_entry_or_null(&container->msc_finalizing,
 					       struct lnet_msg,
 					       msg_list)) != NULL) {
-- 
1.8.3.1



More information about the lustre-devel mailing list