[lustre-devel] [PATCH 15/25] lustre: o2iblnd: reconnect peer for REJ_INVALID_SERVICE_ID

James Simmons jsimmons at infradead.org
Tue Sep 25 19:48:07 PDT 2018


From: Sergey Cheremencev <c17829 at cray.com>

Don't kill the peer in case of INVALID_SERVICE_ID. This produces
huge number of peers for the same nid and may cause an OOM.

The OOM was frequently seen with mlnx-ofa-kernel-2.3 where used
RCU mechanism in mlx4_cq_free. In older mlnx4 versions to mitigate
the issue RCU was changed with spin locks.

Signed-off-by: Sergey Cheremencev <c17829 at cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9094
Seagate-bug-id: MRP-4056
Reviewed-on: https://review.whamcloud.com/25378
Reviewed-by: Doug Oucharek <dougso at me.com>
Reviewed-by: Amir Shehata <ashehata at whamcloud.com>
Reviewed-by: Oleg Drokin <green at whamcloud.com>
Signed-off-by: James Simmons <jsimmons at infradead.org>
---
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    | 1 +
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index a3d89ec..de04355 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -460,6 +460,7 @@ struct kib_rej {
 #define IBLND_REJECT_RDMA_FRAGS	    6
 /* peer_ni's msg queue size doesn't match mine */
 #define IBLND_REJECT_MSG_QUEUE_SIZE 7
+#define IBLND_REJECT_INVALID_SRV_ID 8
 
 /***********************************************************************/
 
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index a6b261a..dc71554 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -2611,6 +2611,10 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	case IBLND_REJECT_CONN_UNCOMPAT:
 		reason = "version negotiation";
 		break;
+
+	case IBLND_REJECT_INVALID_SRV_ID:
+		reason = "invalid service id";
+		break;
 	}
 
 	conn->ibc_reconnect = 1;
@@ -2648,6 +2652,8 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		break;
 
 	case IB_CM_REJ_INVALID_SERVICE_ID:
+		kiblnd_check_reconnect(conn, IBLND_MSG_VERSION, 0,
+				       IBLND_REJECT_INVALID_SRV_ID, NULL);
 		CNETERR("%s rejected: no listener at %d\n",
 			libcfs_nid2str(peer_ni->ibp_nid),
 			*kiblnd_tunables.kib_service);
-- 
1.8.3.1



More information about the lustre-devel mailing list