[lustre-devel] [PATCH 17/22] lnet: Signal completion on ping send failure

James Simmons jsimmons at infradead.org
Sun Nov 20 06:17:03 PST 2022


From: Chris Horn <chris.horn at hpe.com>

Call complete() on the ping_data::completion if we get
LNET_EVENT_SEND with non-zero status. Otherwise the thread which
issued the ping is stuck waiting for the full ping timeout.

A pd_unlinked member is added to struct ping_data to indicate whether
the associated MD has been unlinked. This is checked by lnet_ping() to
determine whether it needs to explicitly called LNetMDUnlink().

Lastly, in cases where we do not receive a reply, we now return the
value of pd.rc, if it is non-zero, rather than -EIO. This can provide
more information about the underlying ping failure.

HPE-bug-id: LUS-11317
WC-bug-id: https://jira.whamcloud.com/browse/LU-16290
Lustre-commit: 48c34c71de65e8a25 ("LU-16290 lnet: Signal completion on ping send failure")
Signed-off-by: Chris Horn <chris.horn at hpe.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49020
Reviewed-by: Serguei Smirnov <ssmirnov at whamcloud.com>
Reviewed-by: Frank Sehr <fsehr at whamcloud.com>
Reviewed-by: Oleg Drokin <green at whamcloud.com>
Signed-off-by: James Simmons <jsimmons at infradead.org>
---
 net/lnet/lnet/api-ni.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 935c848..8b53adf 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -5333,6 +5333,7 @@ void LNetDebugPeer(struct lnet_processid *id)
 struct ping_data {
 	int rc;
 	int replied;
+	int pd_unlinked;
 	struct lnet_handle_md mdh;
 	struct completion completion;
 };
@@ -5353,7 +5354,12 @@ struct ping_data {
 		pd->replied = 1;
 		pd->rc = event->mlength;
 	}
+
 	if (event->unlinked)
+		pd->pd_unlinked = 1;
+
+	if (event->unlinked ||
+	    (event->type == LNET_EVENT_SEND && event->status))
 		complete(&pd->completion);
 }
 
@@ -5424,13 +5430,14 @@ static int lnet_ping(struct lnet_process_id id4, struct lnet_nid *src_nid,
 		/* NB must wait for the UNLINK event below... */
 	}
 
-	if (wait_for_completion_timeout(&pd.completion, timeout) == 0) {
-		/* Ensure completion in finite time... */
+	/* Ensure completion in finite time... */
+	wait_for_completion_timeout(&pd.completion, timeout);
+	if (!pd.pd_unlinked) {
 		LNetMDUnlink(pd.mdh);
 		wait_for_completion(&pd.completion);
 	}
 	if (!pd.replied) {
-		rc = -EIO;
+		rc = pd.rc ?: -EIO;
 		goto fail_ping_buffer_decref;
 	}
 
-- 
1.8.3.1



More information about the lustre-devel mailing list