[lustre-discuss] o2ib nid connections timeout until an snmp ping

Christian Kuntz c.kuntz at opendrives.com
Tue Mar 9 11:13:02 PST 2021


Hello all,

Requisite preamble: This is debian 10.7 with lustre 2.13.0 (compiled by
yours truly).

We've been observing some odd behavior recently with o2ib NIDs. Everyone's
all connected over the same switch (cards and switch are all mellanox),
each machine has a single network card connected in a bond up to the
switch. Whenever a 'new' machine connects to the others over lnet, `lctl
ping` and other operations will fail to some set of the existing hosts.
Curiously, after an SNMP ping is issued all o2ib operations succeed and
things stabilize.

We've tested the stability with the ib_ suite of tools and the fabric
itself appears stable. As of yet we've not attempted to duplicate the
behavior with tcp NIDs, but we haven't encountered this issue over
approximately one year of using lustre over tcp NIDs.

Here's the relevant dmesg portions:
[72768.234745] LNetError:
16138:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.100.101.210 at o2ib1
added to recovery queue. Health = 0
[72792.235556] LNetError:
16138:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.100.101.210 at o2ib1
added to recovery queue. Health = 0
[72829.229280] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Error sending
GET to 12345-10.100.101.32 at o2ib1: -125
[72829.231426] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Skipped 1
previous similar message
[72966.226069] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Error sending
GET to 12345-10.100.101.32 at o2ib1: -125
[72966.228366] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Skipped 3
previous similar messages
[73006.226876] LNetError:
16138:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx:
active_txs, 1 seconds
[73006.228085] LNetError:
16138:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 31 previous
similar messages
[73006.229140] LNetError: 16138:0:(o2iblnd_cb.c:3426:kiblnd_check_conns())
Timed out RDMA with 10.100.101.32 at o2ib1 (7): c: 6, oc: 0, rc: 8
[73006.231045] LNetError: 16138:0:(o2iblnd_cb.c:3426:kiblnd_check_conns())
Skipped 31 previous similar messages
[73016.243190] LNet: 16138:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed
out tx for 10.100.101.36 at o2ib1: 9 seconds
[73016.243195] LNet: 16138:0:(o2iblnd_cb.c:3397:kiblnd_check_conns())
Skipped 60 previous similar messages
[73032.243722] LNetError:
16138:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.100.101.210 at o2ib1
added to recovery queue. Health = 0
[73261.244179] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Error sending
GET to 12345-10.100.101.32 at o2ib1: -125
[73261.246265] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Skipped 11
previous similar messages

Is this normal/known behavior with 2.13, or have I missed some portion of
o2ib net setups?

Please let me know if further information is needed.

Cheers, and thanks for your time,
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210309/f5df2c5d/attachment.html>


More information about the lustre-discuss mailing list