[lustre-discuss] Clients looses IB connection to OSS.

Thomas Stibor t.stibor at gsi.de
Mon May 1 08:59:28 PDT 2017


Hi,

see JIRA: https://jira.hpdd.intel.com/browse/LU-5718

What seems to work as a quick fix (for older versions) is to set the
value of parameter max_pages_per_rpc=64

As written in https://jira.hpdd.intel.com/browse/LU-5718
the issue is resolved, however for upcoming version 2.10.0

Cheers
 Thomas

On Mon, May 01, 2017 at 04:47:32PM +0200, Hans Henrik Happe wrote:
> Hi,
> 
> We have experienced problems with loosing connection to OSS. It starts with:
> 
> May  1 03:35:46 node872 kernel: LNetError:
> 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
> fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 128/236 dst
> idx/frags: 128/236
> May  1 03:35:46 node872 kernel: LNetError:
> 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> 10.21.10.116 at o2ib: -90
> 
> The rest of the log is attached.
> 
> After this Lustre access is very slow. I.e. a 'df' can take minutes.
> Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add'
> makes ping work again until file I/O starts. Then I/O errors again.
> 
> We use both IB and TCP on servers, so no routers.
> 
> In the attached log astro-OST0001 has been moved to the other server in
> the HA pair. This is because 'lctl dl -t' showed strange output when on
> the right server:
> 
> # lctl dl -t
>   0 UP mgc MGC10.21.10.102 at o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5
>   1 UP lov astro-clilov-ffff88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
>   2 UP lmv astro-clilmv-ffff88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
>   3 UP mdc astro-MDT0000-mdc-ffff88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.102 at o2ib
>   4 UP osc astro-OST0002-osc-ffff88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.116 at o2ib
>   5 UP osc astro-OST0001-osc-ffff88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 172.20.10.115 at tcp1
>   6 UP osc astro-OST0003-osc-ffff88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.117 at o2ib
>   7 UP osc astro-OST0000-osc-ffff88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.114 at o2ib
> 
> So astro-OST0001 seems to be connected through 172.20.10.115 at tcp1, even
> though it uses 10.21.10.115 at o2ib (verified by performance test and
> disabling tcp1 on IB nodes).
> 
> Please ask for more details if needed.
> 
> Cheers,
> Hans Henrik
> 

> May  1 03:35:46 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 128/236 dst idx/frags: 128/236
> May  1 03:35:46 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:35:46 node872 kernel: Lustre: 5606:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1493602541/real 1493602541]  req at ffff880e99cea080 x1565604440535580/t0(0) o4->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:6/4 lens 608/448 e 0 to 1 dl 1493602585 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
> May  1 03:35:46 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:35:46 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:35:52 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493602546/real 1493602546]  req at ffff88103e0f10c0 x1565604440535684/t0(0) o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493602552 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 03:35:52 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
> May  1 03:36:17 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493602571/real 1493602571]  req at ffff881056dd39c0 x1565604440535728/t0(0) o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.115@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493602577 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 03:36:18 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection to astro-OST0001 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:36:18 node872 kernel: Lustre: Skipped 7 previous similar messages
> May  1 03:36:24 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493602578/real 1493602578]  req at ffff8808cf3c6380 x1565604440535756/t0(0) o8->astro-OST0001-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493602584 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 03:36:24 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 1 previous similar message
> May  1 03:36:43 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:36:43 node872 kernel: Lustre: Skipped 6 previous similar messages
> May  1 03:36:43 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 128/236 dst idx/frags: 128/236
> May  1 03:36:43 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7 previous similar messages
> May  1 03:36:43 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 03:36:43 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar messages
> May  1 03:36:43 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:36:43 node872 kernel: Lustre: 5606:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1493602603/real 1493602603]  req at ffff880e99cea080 x1565604440535580/t0(0) o4->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:6/4 lens 608/448 e 0 to 1 dl 1493602647 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
> May  1 03:36:43 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:36:43 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:36:43 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:36:43 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:36:43 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:36:43 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:36:43 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:36:43 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:37:14 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493602628/real 1493602628]  req at ffff880d375b46c0 x1565604440535888/t0(0) o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493602634 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 03:37:14 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
> May  1 03:37:39 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1493602653/real 0]  req at ffff880e99cea380 x1565604440535928/t0(0) o8->astro-OST0001-osc-ffff881070c95c00 at 172.20.10.116@tcp1:28/4 lens 520/544 e 0 to 1 dl 1493602659 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 03:37:39 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 1 previous similar message
> May  1 03:38:48 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:38:48 node872 kernel: Lustre: Skipped 7 previous similar messages
> May  1 03:38:54 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493602728/real 1493602728]  req at ffff880e99ceac80 x1565604440536052/t0(0) o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.115@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493602734 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 03:38:54 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
> May  1 03:39:13 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:39:13 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 128/236 dst idx/frags: 128/236
> May  1 03:39:13 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7 previous similar messages
> May  1 03:39:13 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 03:39:13 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar messages
> May  1 03:39:13 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:39:13 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:39:13 node872 kernel: Lustre: Skipped 7 previous similar messages
> May  1 03:39:13 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:39:13 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:39:13 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:39:13 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:39:13 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:39:13 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:39:13 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:39:45 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection to astro-OST0001 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:39:45 node872 kernel: Lustre: Skipped 7 previous similar messages
> May  1 03:40:16 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493602810/real 1493602810]  req at ffff881037b230c0 x1565604440536252/t0(0) o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.115@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493602816 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 03:40:16 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
> May  1 03:41:50 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:41:50 node872 kernel: Lustre: Skipped 7 previous similar messages
> May  1 03:42:15 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:42:15 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 128/236 dst idx/frags: 128/236
> May  1 03:42:15 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7 previous similar messages
> May  1 03:42:15 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 03:42:15 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar messages
> May  1 03:42:15 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:42:15 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:42:15 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:42:15 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:42:15 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:42:15 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:42:15 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:42:15 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:42:15 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:42:46 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493602960/real 1493602960]  req at ffff881056dd33c0 x1565604440536568/t0(0) o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493602966 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 03:42:46 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
> May  1 03:42:47 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection to astro-OST0001 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:42:47 node872 kernel: Lustre: Skipped 7 previous similar messages
> May  1 03:44:52 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:44:52 node872 kernel: Lustre: Skipped 7 previous similar messages
> May  1 03:45:17 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 128/236 dst idx/frags: 128/236
> May  1 03:45:17 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7 previous similar messages
> May  1 03:45:17 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 03:45:17 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar messages
> May  1 03:45:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:45:17 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:45:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:45:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:45:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:45:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:45:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:45:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:45:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:47:11 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1493603224/real 0]  req at ffff880d375b43c0 x1565604440537072/t0(0) o8->astro-OST0001-osc-ffff881070c95c00 at 172.20.10.116@tcp1:28/4 lens 520/544 e 0 to 1 dl 1493603230 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 03:47:11 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 24 previous similar messages
> May  1 03:47:54 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:47:54 node872 kernel: Lustre: Skipped 8 previous similar messages
> May  1 03:48:20 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst idx/frags: 249/256
> May  1 03:48:20 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 03:48:20 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar messages
> May  1 03:48:20 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:48:20 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:48:20 node872 kernel: Lustre: Skipped 8 previous similar messages
> May  1 03:48:20 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:48:20 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:48:20 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:48:20 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:48:20 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:48:20 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:48:20 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 14 previous similar messages
> May  1 03:48:20 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 03:49:17 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:49:17 node872 kernel: Lustre: Skipped 7 previous similar messages
> May  1 03:49:17 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst idx/frags: 249/256
> May  1 03:49:17 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 03:49:17 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar messages
> May  1 03:49:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:49:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:49:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:49:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:49:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:49:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:49:17 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:49:17 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7 previous similar messages
> May  1 03:49:17 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 03:51:47 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:51:47 node872 kernel: Lustre: Skipped 7 previous similar messages
> May  1 03:51:47 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst idx/frags: 249/256
> May  1 03:51:47 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 03:51:47 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar messages
> May  1 03:51:47 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:51:47 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:51:47 node872 kernel: Lustre: Skipped 14 previous similar messages
> May  1 03:51:47 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:51:47 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:51:47 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:51:47 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:51:47 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:51:47 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:51:47 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7 previous similar messages
> May  1 03:51:47 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 03:52:50 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:52:50 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 03:52:50 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 03:52:50 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:52:50 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 03:52:50 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:52:50 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 03:52:50 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:55:20 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst idx/frags: 249/256
> May  1 03:55:20 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 03:55:20 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 15 previous similar messages
> May  1 03:55:20 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:55:20 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:55:20 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:55:20 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:55:20 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:55:20 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:55:20 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:55:20 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 14 previous similar messages
> May  1 03:55:20 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 03:55:51 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493603745/real 1493603745]  req at ffff880d375b49c0 x1565604440538216/t0(0) o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493603751 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 03:55:51 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 67 previous similar messages
> May  1 03:57:57 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 03:57:57 node872 kernel: Lustre: Skipped 18 previous similar messages
> May  1 03:58:22 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst idx/frags: 249/256
> May  1 03:58:22 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 03:58:22 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar messages
> May  1 03:58:22 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:58:22 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 03:58:22 node872 kernel: Lustre: Skipped 19 previous similar messages
> May  1 03:58:22 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:58:22 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:58:22 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:58:22 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:58:22 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:58:22 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 03:58:22 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7 previous similar messages
> May  1 03:58:22 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:01:24 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:01:24 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:01:24 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:01:24 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:01:24 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:01:24 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:01:24 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:01:24 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:04:26 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst idx/frags: 249/256
> May  1 04:04:26 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 04:04:26 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 15 previous similar messages
> May  1 04:04:26 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:04:26 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:04:26 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:04:26 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:04:26 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:04:26 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:04:26 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:04:26 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 15 previous similar messages
> May  1 04:04:26 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:05:54 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1493604348/real 0]  req at ffff880d375b49c0 x1565604440539376/t0(0) o8->astro-OST0002-osc-ffff881070c95c00 at 172.20.10.116@tcp1:28/4 lens 520/544 e 0 to 1 dl 1493604354 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 04:05:54 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 58 previous similar messages
> May  1 04:07:03 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 04:07:03 node872 kernel: Lustre: Skipped 20 previous similar messages
> May  1 04:07:28 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:07:28 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:07:28 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 04:07:28 node872 kernel: Lustre: Skipped 20 previous similar messages
> May  1 04:07:28 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:07:28 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:07:28 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:07:28 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:07:28 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:07:28 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:10:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:10:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:10:30 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:10:30 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:10:30 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:10:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:10:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:10:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:13:32 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst idx/frags: 249/256
> May  1 04:13:32 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 04:13:32 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 23 previous similar messages
> May  1 04:13:32 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:13:32 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:13:32 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:13:32 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:13:32 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:13:32 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:13:32 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:13:32 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 23 previous similar messages
> May  1 04:13:32 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:14:30 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:14:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:14:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:14:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:14:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:14:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:14:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:14:30 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:16:41 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493604995/real 1493604995]  req at ffff881056dd30c0 x1565604440540644/t0(0) o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.115@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493605001 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 04:16:41 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 66 previous similar messages
> May  1 04:17:00 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:17:00 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:17:00 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:17:00 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:17:00 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:17:00 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:17:00 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:17:00 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:17:32 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection to astro-OST0001 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 04:17:32 node872 kernel: Lustre: Skipped 25 previous similar messages
> May  1 04:19:37 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 04:19:37 node872 kernel: Lustre: Skipped 26 previous similar messages
> May  1 04:20:02 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:20:02 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:20:02 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:20:02 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:20:02 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:20:02 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:20:02 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:20:02 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:23:04 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:23:04 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:23:04 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:23:04 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:23:04 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:23:04 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:23:04 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:23:04 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:26:06 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst idx/frags: 249/256
> May  1 04:26:06 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116 at o2ib: -90
> May  1 04:26:06 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 39 previous similar messages
> May  1 04:26:06 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:26:06 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:26:06 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:26:06 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:26:06 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:26:06 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:26:06 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:26:06 node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 39 previous similar messages
> May  1 04:26:06 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:26:44 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493605598/real 1493605598]  req at ffff88081c534080 x1565604440541864/t0(0) o8->astro-OST0001-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493605604 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> May  1 04:26:44 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 65 previous similar messages
> May  1 04:27:08 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:27:08 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:27:08 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:27:08 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:27:08 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:27:08 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:27:08 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:27:08 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:29:38 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib)
> May  1 04:29:38 node872 kernel: Lustre: Skipped 22 previous similar messages
> May  1 04:29:38 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:29:38 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:29:38 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> May  1 04:29:38 node872 kernel: Lustre: Skipped 22 previous similar messages
> May  1 04:29:38 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:29:38 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:29:38 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:29:38 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:29:38 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:29:38 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:32:40 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:32:40 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> May  1 04:32:40 node872 kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88081d982000
> May  1 04:32:40 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff88103dd63000
> 




> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list