[lustre-devel] [PATCH 04/39] lustre: ldlm: Do not hang if recovery restarted during lock replay

James Simmons jsimmons at infradead.org
Thu Jan 21 09:16:27 PST 2021

From: Oleg Drokin <green at whamcloud.com>

LU-13600 introduced lock ratelimiting logic, but it did not take
into account that if there's a disconnection in the REPLAY_LOCKS
phase then yet unsent locks get stuck in the sending queue so
the replay locks thread hangs with imp_replay_inflight elevated
above zero.

The direct consequence from that is recovery state machine never
advances from REPLAY to REPLAY_LOCKS status when imp_replay_inflight
is non zero.

Adjust __ldlm_replay_locks() to check if the import state changed
before attempting to send any more requests.

Add a testcase.

Fixes: 8cc7f22847 ("lustre: ptlrpc: limit rate of lock replays")
WC-bug-id: https://jira.whamcloud.com/browse/LU-14027
Lustre-commit: 7ca495ec67f474 ("LU-14027 ldlm: Do not hang if recovery restarted during lock replay")
Signed-off-by: Oleg Drokin <green at whamcloud.com>
Reviewed-on: https://review.whamcloud.com/40238
Reviewed-by: Mike Pershin <mpershin at whamcloud.com>
Reviewed-by: Andreas Dilger <adilger at whamcloud.com>
Signed-off-by: James Simmons <jsimmons at infradead.org>
 fs/lustre/ldlm/ldlm_request.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index a2e1969..86b10a7 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -2271,9 +2271,12 @@ int __ldlm_replay_locks(struct obd_import *imp, bool rate_limit)
 		lock = list_first_entry(&list, struct ldlm_lock,
-		if (rc) {
+		/* If we disconnected in the middle - cleanup and let
+		 * reconnection to happen again. LU-14027
+		 */
+		if (rc || (imp->imp_state != LUSTRE_IMP_REPLAY_LOCKS)) {
-			continue; /* or try to do the rest? */
+			continue;
 		rc = replay_one_lock(imp, lock);

More information about the lustre-devel mailing list