[lustre-devel] [PATCH 107/622] lnet: router handling

James Simmons jsimmons at infradead.org
Thu Feb 27 13:09:35 PST 2020


From: Amir Shehata <ashehata at whamcloud.com>

Re-create the md and mdh if the router checker ping times out.
When re-transmitting a message do so even if the peer is marked down
to fulfill the message's retry quota.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11272
Lustre-commit: 05becd69bc0c ("LU-11272 lnet: router handling")
Signed-off-by: Amir Shehata <ashehata at whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33043
Reviewed-by: Olaf Weber <olaf.weber at hpe.com>
Reviewed-by: Sonia Sharma <sharmaso at whamcloud.com>
Reviewed-by: Oleg Drokin <green at whamcloud.com>
Signed-off-by: James Simmons <jsimmons at infradead.org>
---
 net/lnet/lnet/lib-move.c | 12 ++++++++++--
 net/lnet/lnet/router.c   |  8 +++++++-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index eb0b48d..3cab970 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -678,7 +678,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
  *     may drop the lnet_net_lock
  */
 static int
-lnet_peer_alive_locked(struct lnet_ni *ni, struct lnet_peer_ni *lp)
+lnet_peer_alive_locked(struct lnet_ni *ni, struct lnet_peer_ni *lp,
+		       struct lnet_msg *msg)
 {
 	time64_t now = ktime_get_seconds();
 
@@ -689,6 +690,13 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		return 1;
 
 	/*
+	 * If we're resending a message, let's attempt to send it even if
+	 * the peer is down to fulfill our resend quota on the message
+	 */
+	if (msg->msg_retry_count > 0)
+		return 1;
+
+	/*
 	 * Peer appears dead, but we should avoid frequent NI queries (at
 	 * most once per lnet_queryinterval seconds).
 	 */
@@ -746,7 +754,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 	/* NB 'lp' is always the next hop */
 	if (!(msg->msg_target.pid & LNET_PID_USERFLAG) &&
-	    !lnet_peer_alive_locked(ni, lp)) {
+	    !lnet_peer_alive_locked(ni, lp, msg)) {
 		the_lnet.ln_counters[cpt]->drop_count++;
 		the_lnet.ln_counters[cpt]->drop_length += msg->msg_len;
 		lnet_net_unlock(cpt);
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 7c3bbd8..66a116c 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -1042,7 +1042,13 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 	}
 
 	rcd = rtr->lpni_rcd;
-	if (!rcd || rcd->rcd_nnis > rcd->rcd_pingbuffer->pb_nnis)
+
+	/* The response to the router checker ping could've timed out and
+	 * the mdh might've been invalidated, so we need to update it
+	 * again.
+	 */
+	if (!rcd || rcd->rcd_nnis > rcd->rcd_pingbuffer->pb_nnis ||
+	    LNetMDHandleIsInvalid(rcd->rcd_mdh))
 		rcd = lnet_update_rc_data_locked(rtr);
 	if (!rcd)
 		return;
-- 
1.8.3.1



More information about the lustre-devel mailing list