[lustre-devel] [PATCH 415/622] lnet: Sync the start of discovery and monitor threads
James Simmons
jsimmons at infradead.org
Thu Feb 27 13:14:43 PST 2020
From: Chris Horn <hornc at cray.com>
The discovery thread starts up before the monitor thread so it may
issue PUTs or GETs before the monitor thread has a chance to
initialize its data structures (namely the_lnet.ln_mt_rstq). This can
result in an OOPs when we attempt to attach response trackers to MDs.
Introduce a completion to synchronize the startup of these threads.
WC-bug-id: https://jira.whamcloud.com/browse/LU-12537
Lustre-commit: 9283e2ed6655 ("LU-12537 lnet: Sync the start of discovery and monitor threads")
Signed-off-by: Chris Horn <hornc at cray.com>
Reviewed-on: https://review.whamcloud.com/35478
Reviewed-by: Alexandr Boyko <c17825 at cray.com>
Reviewed-by: Amir Shehata <ashehata at whamcloud.com>
Reviewed-by: Oleg Drokin <green at whamcloud.com>
Signed-off-by: James Simmons <jsimmons at infradead.org>
---
include/linux/lnet/lib-types.h | 5 +++++
net/lnet/lnet/api-ni.c | 3 +++
net/lnet/lnet/lib-move.c | 1 +
net/lnet/lnet/peer.c | 11 ++++++++++-
4 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index b240361..1009a69 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -1161,6 +1161,11 @@ struct lnet {
/* recovery eq handler */
struct lnet_handle_eq ln_mt_eqh;
+ /*
+ * Completed when the discovery and monitor threads can enter their
+ * work loops
+ */
+ struct completion ln_started;
};
#endif
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 65f1f17..aa5ca52 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1062,6 +1062,7 @@ struct lnet_libhandle *
INIT_LIST_HEAD(&the_lnet.ln_mt_peerNIRecovq);
init_waitqueue_head(&the_lnet.ln_dc_waitq);
LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
+ init_completion(&the_lnet.ln_started);
rc = lnet_descriptor_setup();
if (rc != 0)
@@ -2583,6 +2584,8 @@ void lnet_lib_exit(void)
mutex_unlock(&the_lnet.ln_api_mutex);
+ complete_all(&the_lnet.ln_started);
+
/* wait for all routers to start */
lnet_wait_router_start();
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 9a4c426..413397c 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3529,6 +3529,7 @@ void lnet_monitor_thr_stop(void)
lnet_build_msg_event(msg, LNET_EVENT_PUT);
+ wait_for_completion(&the_lnet.ln_started);
/*
* Must I ACK? If so I'll grab the ack_wmd out of the header and put
* it back into the ACK during lnet_finalize()
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index b0ca1de..49da7a1 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -3258,6 +3258,8 @@ static int lnet_peer_discovery(void *arg)
struct lnet_peer *lp;
int rc;
+ wait_for_completion(&the_lnet.ln_started);
+
CDEBUG(D_NET, "started\n");
for (;;) {
@@ -3429,7 +3431,14 @@ void lnet_peer_discovery_stop(void)
LASSERT(the_lnet.ln_dc_state == LNET_DC_STATE_RUNNING);
the_lnet.ln_dc_state = LNET_DC_STATE_STOPPING;
- wake_up(&the_lnet.ln_dc_waitq);
+
+ /* In the LNetNIInit() path we may be stopping discovery before it
+ * entered its work loop
+ */
+ if (!completion_done(&the_lnet.ln_started))
+ complete(&the_lnet.ln_started);
+ else
+ wake_up(&the_lnet.ln_dc_waitq);
wait_event(the_lnet.ln_dc_waitq,
the_lnet.ln_dc_state == LNET_DC_STATE_SHUTDOWN);
--
1.8.3.1
More information about the lustre-devel
mailing list