[lustre-discuss] RDMA too many fragments/timed out - clients slowing entire filesystem performance

Brian W. Johanson bjohanso at psc.edu
Tue Nov 1 16:08:42 PDT 2016


Centos 7.2
Lustre 2.8.0
ZFS 0.6.5.5
OPA 10.2.0.0.158


The clients and servers are on the same OPA network, no routing.  Once a client 
gets in this state, the filesystem performance drops to a faction of what it is 
capable of.
The client must be rebooted to clear the issue.

I imagine I am missing a bug in jira for this issue, does this look like a known 
issue?


Pertinent debug messages from the server:

00000800:00020000:34.0:1478026118.277782:0:29892:0:(o2iblnd_cb.c:3109:kiblnd_check_txs_locked()) 
Timed out tx: active_txs, 4 seconds
00000800:00020000:34.0:1478026118.277785:0:29892:0:(o2iblnd_cb.c:3172:kiblnd_check_conns()) 
Timed out RDMA with 10.4.119.112 at o2ib (3): c: 112, oc: 0, rc: 66
00000800:00000100:34.0:1478026118.277787:0:29892:0:(o2iblnd_cb.c:1913:kiblnd_close_conn_locked()) 
Closing conn to 10.4.119.112 at o2ib: error -110(waiting)
00000100:00020000:34.0:1478026118.277844:0:29892:0:(events.c:447:server_bulk_callback()) 
event type 5, status -103, desc ffff883e8e8bcc00
00000100:00020000:34.0:1478026118.288714:0:29892:0:(events.c:447:server_bulk_callback()) 
event type 3, status -103, desc ffff883e8e8bcc00
00000100:00020000:34.0:1478026118.299574:0:29892:0:(events.c:447:server_bulk_callback()) 
event type 5, status -103, desc ffff8810e92e9c00
00000100:00020000:34.0:1478026118.310434:0:29892:0:(events.c:447:server_bulk_callback()) 
event type 3, status -103, desc ffff8810e92e9c00


And from the client:

00000400:00000100:8.0:1477949860.565777:0:3629:0:(lib-move.c:1489:lnet_parse_put()) 
Dropping PUT from 12345-10.4.108.81 at o2ib portal 4 match 1549728742532740 offset 
0 length 192: 4
00000400:00000100:8.0:1477949860.565782:0:3629:0:(lib-move.c:1489:lnet_parse_put()) 
Dropping PUT from 12345-10.4.108.81 at o2ib portal 4 match 1549728742532740 offset 
0 length 192: 4
00000800:00020000:8.0:1477949860.702666:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) 
RDMA has too many fragments for peer 10.4.108.81 at o2ib (32), src idx/frags: 16/27 
dst idx/frags: 16/27
00000800:00020000:8.0:1477949860.702667:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply()) 
Can't setup rdma for GET from 10.4.108.81 at o2ib: -90
00000100:00020000:8.0:1477949860.702669:0:3629:0:(events.c:201:client_bulk_callback()) 
event type 1, status -5, desc ffff880fd5d9bc00
00000800:00020000:8.0:1477949860.816666:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) 
RDMA has too many fragments for peer 10.4.108.81 at o2ib (32), src idx/frags: 16/27 
dst idx/frags: 16/27
00000800:00020000:8.0:1477949860.816668:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply()) 
Can't setup rdma for GET from 10.4.108.81 at o2ib: -90
00000100:00020000:8.0:1477949860.816669:0:3629:0:(events.c:201:client_bulk_callback()) 
event type 1, status -5, desc ffff880fd5d9bc00
00000400:00000100:8.0:1477949861.573660:0:3629:0:(lib-move.c:1489:lnet_parse_put()) 
Dropping PUT from 12345-10.4.108.81 at o2ib portal 4 match 1549728742532740 offset 
0 length 192: 4
00000400:00000100:8.0:1477949861.573664:0:3629:0:(lib-move.c:1489:lnet_parse_put()) 
Dropping PUT from 12345-10.4.108.81 at o2ib portal 4 match 1549728742532740 offset 
0 length 192: 4
00000400:00000100:8.0:1477949861.573667:0:3629:0:(lib-move.c:1489:lnet_parse_put()) 
Dropping PUT from 12345-10.4.108.81 at o2ib portal 4 match 1549728742532740 offset 
0 length 192: 4
00000400:00000100:8.0:1477949861.573669:0:3629:0:(lib-move.c:1489:lnet_parse_put()) 
Dropping PUT from 12345-10.4.108.81 at o2ib portal 4 match 1549728742532740 offset 
0 length 192: 4
00000400:00000100:8.0:1477949861.573671:0:3629:0:(lib-move.c:1489:lnet_parse_put()) 
Dropping PUT from 12345-10.4.108.81 at o2ib portal 4 match 1549728742532740 offset 
0 length 192: 4
00000400:00000100:8.0:1477949861.573673:0:3629:0:(lib-move.c:1489:lnet_parse_put()) 
Dropping PUT from 12345-10.4.108.81 at o2ib portal 4 match 1549728742532740 offset 
0 length 192: 4
00000400:00000100:8.0:1477949861.573675:0:3629:0:(lib-move.c:1489:lnet_parse_put()) 
Dropping PUT from 12345-10.4.108.81 at o2ib portal 4 match 1549728742532740 offset 
0 length 192: 4
00000400:00000100:8.0:1477949861.573677:0:3629:0:(lib-move.c:1489:lnet_parse_put()) 
Dropping PUT from 12345-10.4.108.81 at o2ib portal 4 match 1549728742532740 offset 
0 length 192: 4
00000800:00020000:8.0:1477949861.721668:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) 
RDMA has too many fragments for peer 10.4.108.81 at o2ib (32), src idx/frags: 16/27 
dst idx/frags: 16/27
00000800:00020000:8.0:1477949861.721669:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply()) 
Can't setup rdma for GET from 10.4.108.81 at o2ib: -90
00000100:00020000:8.0:1477949861.721670:0:3629:0:(events.c:201:client_bulk_callback()) 
event type 1, status -5, desc ffff880fd5d9bc00
00000800:00020000:8.0:1477949861.836668:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) 
RDMA has too many fragments for peer 10.4.108.81 at o2ib (32), src idx/frags: 16/27 
dst idx/frags: 16/27
00000800:00020000:8.0:1477949861.836669:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply()) 
Can't setup rdma for GET from 10.4.108.81 at o2ib: -90
00000100:00020000:8.0:1477949861.836670:0:3629:0:(events.c:201:client_bulk_callback()) 
event type 1, status -5, desc ffff880fd5d9bc00
00000800:00020000:8.0:1477949862.061668:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) 
RDMA has too many fragments for peer 10.4.108.81 at o2ib (32), src idx/frags: 16/27 
dst idx/frags: 16/27
00000800:00020000:8.0:1477949862.061669:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply()) 
Can't setup rdma for GET from 10.4.108.81 at o2ib: -90
00000100:00020000:8.0:1477949862.061670:0:3629:0:(events.c:201:client_bulk_callback()) 
event type 1, status -5, desc ffff880fd5d9bc00

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20161101/6ce95954/attachment.htm>


More information about the lustre-discuss mailing list