[lustre-discuss] Clients looses IB connection to OSS.

Harald van Pee pee at hiskp.uni-bonn.de
Wed Jul 19 09:06:43 PDT 2017


Hi, 

I just wondering if there is also a problem with lustre 2.7 and mellanox ib
even if there is no infiniband router.
We are using lustre with infiniband as the only lnet connection.
Lustre 2.5.3 on the server side and 2.6 on clients runs stable for monthes, 
but we see a memory leak (SUnreclaim grows, and free the cache or unmount does 
not help)
and I just tried 2.7 on client side this week.
Now we have very often 
Connection to OST000X (at ... at o2ib) was lost
messages.
But no message about rdma problems. Only reboot helps on client and server 
side.

The work around Thomas has mentioned does not help on client side.

Is there any work around or solution?
I can not make an update before holiday therefore I think it would be best 
to go back to 2.6?

Any help would be welcome.

Harald


On Monday 01 May 2017 17:59:28 Thomas Stibor wrote:
> Hi,
> 
> see JIRA: https://jira.hpdd.intel.com/browse/LU-5718
> 
> What seems to work as a quick fix (for older versions) is to set the
> value of parameter max_pages_per_rpc=64
> 
> As written in https://jira.hpdd.intel.com/browse/LU-5718
> the issue is resolved, however for upcoming version 2.10.0
> 
> Cheers
>  Thomas
> 
> On Mon, May 01, 2017 at 04:47:32PM +0200, Hans Henrik Happe wrote:
> > Hi,
> > 
> > We have experienced problems with loosing connection to OSS. It starts
> > with:
> > 
> > May  1 03:35:46 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
> > fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 128/236 dst
> > idx/frags: 128/236
> > May  1 03:35:46 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> > 10.21.10.116 at o2ib: -90
> > 
> > The rest of the log is attached.
> > 
> > After this Lustre access is very slow. I.e. a 'df' can take minutes.
> > Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add'
> > makes ping work again until file I/O starts. Then I/O errors again.
> > 
> > We use both IB and TCP on servers, so no routers.
> > 
> > In the attached log astro-OST0001 has been moved to the other server in
> > the HA pair. This is because 'lctl dl -t' showed strange output when on
> > the right server:
> > 
> > # lctl dl -t
> > 
> >   0 UP mgc MGC10.21.10.102 at o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5
> >   1 UP lov astro-clilov-ffff88107412e800
> > 
> > 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
> > 
> >   2 UP lmv astro-clilmv-ffff88107412e800
> > 
> > 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
> > 
> >   3 UP mdc astro-MDT0000-mdc-ffff88107412e800
> > 
> > 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.102 at o2ib
> > 
> >   4 UP osc astro-OST0002-osc-ffff88107412e800
> > 
> > 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.116 at o2ib
> > 
> >   5 UP osc astro-OST0001-osc-ffff88107412e800
> > 
> > 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 172.20.10.115 at tcp1
> > 
> >   6 UP osc astro-OST0003-osc-ffff88107412e800
> > 
> > 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.117 at o2ib
> > 
> >   7 UP osc astro-OST0000-osc-ffff88107412e800
> > 
> > 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.114 at o2ib
> > 
> > So astro-OST0001 seems to be connected through 172.20.10.115 at tcp1, even
> > though it uses 10.21.10.115 at o2ib (verified by performance test and
> > disabling tcp1 on IB nodes).
> > 
> > Please ask for more details if needed.
> > 
> > Cheers,
> > Hans Henrik
> > 
> > 
> > May  1 03:35:46 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
> > fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 128/236 dst
> > idx/frags: 128/236 May  1 03:35:46 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> > 10.21.10.116 at o2ib: -90 May  1 03:35:46 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:35:46 node872 kernel: Lustre:
> > 5606:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > failed due to network error: [sent 1493602541/real 1493602541] 
> > req at ffff880e99cea080 x1565604440535580/t0(0)
> > o4->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:6/4 lens
> > 608/448 e 0 to 1 dl 1493602585 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 May  1
> > 03:35:46 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00:
> > Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress
> > operations using this service will wait for recovery to complete May  1
> > 03:35:46 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00:
> > Connection restored to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1
> > 03:35:46 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:35:46 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:35:46 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:35:46 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:35:46 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:35:46 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:35:46 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:35:52 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for slow reply: [sent 1493602546/real 1493602546] 
> > req at ffff88103e0f10c0 x1565604440535684/t0(0)
> > o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens
> > 520/544 e 0 to 1 dl 1493602552 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 03:35:52 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 7 previous
> > similar messages May  1 03:36:17 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for slow reply: [sent 1493602571/real 1493602571] 
> > req at ffff881056dd39c0 x1565604440535728/t0(0)
> > o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.115@o2ib:28/4 lens
> > 520/544 e 0 to 1 dl 1493602577 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 03:36:18 node872 kernel: Lustre: astro-OST0001-osc-ffff881070c95c00:
> > Connection to astro-OST0001 (at 10.21.10.116 at o2ib) was lost; in progress
> > operations using this service will wait for recovery to complete May  1
> > 03:36:18 node872 kernel: Lustre: Skipped 7 previous similar messages May
> >  1 03:36:24 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for slow reply: [sent 1493602578/real 1493602578] 
> > req at ffff8808cf3c6380 x1565604440535756/t0(0)
> > o8->astro-OST0001-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens
> > 520/544 e 0 to 1 dl 1493602584 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 03:36:24 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 1 previous
> > similar message May  1 03:36:43 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection restored to
> > 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 03:36:43 node872 kernel:
> > Lustre: Skipped 6 previous similar messages May  1 03:36:43 node872
> > kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA
> > has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags:
> > 128/236 dst idx/frags: 128/236 May  1 03:36:43 node872 kernel:
> > LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7
> > previous similar messages May  1 03:36:43 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> > 10.21.10.116 at o2ib: -90 May  1 03:36:43 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar
> > messages May  1 03:36:43 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:36:43 node872 kernel: Lustre:
> > 5606:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > failed due to network error: [sent 1493602603/real 1493602603] 
> > req at ffff880e99cea080 x1565604440535580/t0(0)
> > o4->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:6/4 lens
> > 608/448 e 0 to 1 dl 1493602647 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 May  1
> > 03:36:43 node872 kernel: Lustre: astro-OST0002-osc-ffff881070c95c00:
> > Connection to astro-OST0002 (at 10.21.10.116 at o2ib) was lost; in progress
> > operations using this service will wait for recovery to complete May  1
> > 03:36:43 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:36:43 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:36:43 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:36:43 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:36:43 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:36:43 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:36:43 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:37:14 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for slow reply: [sent 1493602628/real 1493602628] 
> > req at ffff880d375b46c0 x1565604440535888/t0(0)
> > o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens
> > 520/544 e 0 to 1 dl 1493602634 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 03:37:14 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 9 previous
> > similar messages May  1 03:37:39 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for sent delay: [sent 1493602653/real 0]  req at ffff880e99cea380
> > x1565604440535928/t0(0)
> > o8->astro-OST0001-osc-ffff881070c95c00 at 172.20.10.116@tcp1:28/4 lens
> > 520/544 e 0 to 1 dl 1493602659 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 03:37:39 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 1 previous
> > similar message May  1 03:38:48 node872 kernel: Lustre:
> > astro-OST0001-osc-ffff881070c95c00: Connection restored to
> > 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 03:38:48 node872 kernel:
> > Lustre: Skipped 7 previous similar messages May  1 03:38:54 node872
> > kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@
> > Request sent has timed out for slow reply: [sent 1493602728/real
> > 1493602728]  req at ffff880e99ceac80 x1565604440536052/t0(0)
> > o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.115@o2ib:28/4 lens
> > 520/544 e 0 to 1 dl 1493602734 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 03:38:54 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 2 previous
> > similar messages May  1 03:39:13 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection restored to
> > 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 03:39:13 node872 kernel:
> > LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too
> > many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 128/236
> > dst idx/frags: 128/236 May  1 03:39:13 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7 previous similar
> > messages May  1 03:39:13 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> > 10.21.10.116 at o2ib: -90 May  1 03:39:13 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar
> > messages May  1 03:39:13 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:39:13 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 03:39:13 node872 kernel:
> > Lustre: Skipped 7 previous similar messages May  1 03:39:13 node872
> > kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event
> > type 1, status -5, desc ffff88103dd63000 May  1 03:39:13 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:39:13 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:39:13 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:39:13 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:39:13 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:39:13 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:39:45 node872 kernel: Lustre:
> > astro-OST0001-osc-ffff881070c95c00: Connection to astro-OST0001 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 03:39:45 node872 kernel:
> > Lustre: Skipped 7 previous similar messages May  1 03:40:16 node872
> > kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@
> > Request sent has timed out for slow reply: [sent 1493602810/real
> > 1493602810]  req at ffff881037b230c0 x1565604440536252/t0(0)
> > o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.115@o2ib:28/4 lens
> > 520/544 e 0 to 1 dl 1493602816 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 03:40:16 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 12 previous
> > similar messages May  1 03:41:50 node872 kernel: Lustre:
> > astro-OST0001-osc-ffff881070c95c00: Connection restored to
> > 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 03:41:50 node872 kernel:
> > Lustre: Skipped 7 previous similar messages May  1 03:42:15 node872
> > kernel: Lustre: astro-OST0002-osc-ffff881070c95c00: Connection restored
> > to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 03:42:15 node872
> > kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA
> > has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags:
> > 128/236 dst idx/frags: 128/236 May  1 03:42:15 node872 kernel:
> > LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7
> > previous similar messages May  1 03:42:15 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> > 10.21.10.116 at o2ib: -90 May  1 03:42:15 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar
> > messages May  1 03:42:15 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:42:15 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 03:42:15 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:42:15 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:42:15 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:42:15 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:42:15 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:42:15 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:42:15 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:42:46 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for slow reply: [sent 1493602960/real 1493602960] 
> > req at ffff881056dd33c0 x1565604440536568/t0(0)
> > o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens
> > 520/544 e 0 to 1 dl 1493602966 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 03:42:46 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 14 previous
> > similar messages May  1 03:42:47 node872 kernel: Lustre:
> > astro-OST0001-osc-ffff881070c95c00: Connection to astro-OST0001 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 03:42:47 node872 kernel:
> > Lustre: Skipped 7 previous similar messages May  1 03:44:52 node872
> > kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection restored
> > to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 03:44:52 node872
> > kernel: Lustre: Skipped 7 previous similar messages May  1 03:45:17
> > node872 kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma())
> > RDMA has too many fragments for peer 10.21.10.116 at o2ib (256), src
> > idx/frags: 128/236 dst idx/frags: 128/236 May  1 03:45:17 node872
> > kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped
> > 7 previous similar messages May  1 03:45:17 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> > 10.21.10.116 at o2ib: -90 May  1 03:45:17 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous similar
> > messages May  1 03:45:17 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:45:17 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 03:45:17 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:45:17 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:45:17 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:45:17 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:45:17 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:45:17 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:45:17 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:47:11 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for sent delay: [sent 1493603224/real 0]  req at ffff880d375b43c0
> > x1565604440537072/t0(0)
> > o8->astro-OST0001-osc-ffff881070c95c00 at 172.20.10.116@tcp1:28/4 lens
> > 520/544 e 0 to 1 dl 1493603230 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 03:47:11 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 24 previous
> > similar messages May  1 03:47:54 node872 kernel: Lustre:
> > astro-OST0001-osc-ffff881070c95c00: Connection restored to
> > 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 03:47:54 node872 kernel:
> > Lustre: Skipped 8 previous similar messages May  1 03:48:20 node872
> > kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA
> > has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags:
> > 249/256 dst idx/frags: 249/256 May  1 03:48:20 node872 kernel:
> > LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma
> > for GET from 10.21.10.116 at o2ib: -90 May  1 03:48:20 node872 kernel:
> > LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous
> > similar messages May  1 03:48:20 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:48:20 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 03:48:20 node872 kernel:
> > Lustre: Skipped 8 previous similar messages May  1 03:48:20 node872
> > kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event
> > type 1, status -5, desc ffff88103dd63000 May  1 03:48:20 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:48:20 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:48:20 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:48:20 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:48:20 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:48:20 node872 kernel:
> > LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 14
> > previous similar messages May  1 03:48:20 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 03:49:17 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection restored to
> > 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 03:49:17 node872 kernel:
> > Lustre: Skipped 7 previous similar messages May  1 03:49:17 node872
> > kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA
> > has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags:
> > 249/256 dst idx/frags: 249/256 May  1 03:49:17 node872 kernel:
> > LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma
> > for GET from 10.21.10.116 at o2ib: -90 May  1 03:49:17 node872 kernel:
> > LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous
> > similar messages May  1 03:49:17 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:49:17 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:49:17 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:49:17 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:49:17 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:49:17 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:49:17 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:49:17 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7 previous similar
> > messages May  1 03:49:17 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 03:51:47 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection restored to
> > 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 03:51:47 node872 kernel:
> > Lustre: Skipped 7 previous similar messages May  1 03:51:47 node872
> > kernel: LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA
> > has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags:
> > 249/256 dst idx/frags: 249/256 May  1 03:51:47 node872 kernel:
> > LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma
> > for GET from 10.21.10.116 at o2ib: -90 May  1 03:51:47 node872 kernel:
> > LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous
> > similar messages May  1 03:51:47 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:51:47 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 03:51:47 node872 kernel:
> > Lustre: Skipped 14 previous similar messages May  1 03:51:47 node872
> > kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event
> > type 1, status -5, desc ffff88103dd63000 May  1 03:51:47 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:51:47 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:51:47 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:51:47 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:51:47 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:51:47 node872 kernel:
> > LNetError: 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7
> > previous similar messages May  1 03:51:47 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 03:52:50 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:52:50 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 03:52:50 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 03:52:50 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:52:50 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 03:52:50 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:52:50 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 03:52:50 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:55:20 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
> > fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst
> > idx/frags: 249/256 May  1 03:55:20 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> > 10.21.10.116 at o2ib: -90 May  1 03:55:20 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 15 previous similar
> > messages May  1 03:55:20 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:55:20 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:55:20 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:55:20 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:55:20 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:55:20 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:55:20 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:55:20 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 14 previous
> > similar messages May  1 03:55:20 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 03:55:51 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for slow reply: [sent 1493603745/real 1493603745] 
> > req at ffff880d375b49c0 x1565604440538216/t0(0)
> > o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens
> > 520/544 e 0 to 1 dl 1493603751 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 03:55:51 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 67 previous
> > similar messages May  1 03:57:57 node872 kernel: Lustre:
> > astro-OST0001-osc-ffff881070c95c00: Connection restored to
> > 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 03:57:57 node872 kernel:
> > Lustre: Skipped 18 previous similar messages May  1 03:58:22 node872
> > kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA
> > has too many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags:
> > 249/256 dst idx/frags: 249/256 May  1 03:58:22 node872 kernel:
> > LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma
> > for GET from 10.21.10.116 at o2ib: -90 May  1 03:58:22 node872 kernel:
> > LNetError: 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 7 previous
> > similar messages May  1 03:58:22 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 03:58:22 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 03:58:22 node872 kernel:
> > Lustre: Skipped 19 previous similar messages May  1 03:58:22 node872
> > kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event
> > type 1, status -5, desc ffff88103dd63000 May  1 03:58:22 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:58:22 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:58:22 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:58:22 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:58:22 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 03:58:22 node872 kernel:
> > LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 7
> > previous similar messages May  1 03:58:22 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:01:24 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:01:24 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:01:24 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:01:24 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:01:24 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:01:24 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:01:24 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:01:24 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:04:26 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
> > fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst
> > idx/frags: 249/256 May  1 04:04:26 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> > 10.21.10.116 at o2ib: -90 May  1 04:04:26 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 15 previous similar
> > messages May  1 04:04:26 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:04:26 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:04:26 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:04:26 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:04:26 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:04:26 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:04:26 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:04:26 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 15 previous
> > similar messages May  1 04:04:26 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:05:54 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for sent delay: [sent 1493604348/real 0]  req at ffff880d375b49c0
> > x1565604440539376/t0(0)
> > o8->astro-OST0002-osc-ffff881070c95c00 at 172.20.10.116@tcp1:28/4 lens
> > 520/544 e 0 to 1 dl 1493604354 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 04:05:54 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 58 previous
> > similar messages May  1 04:07:03 node872 kernel: Lustre:
> > astro-OST0001-osc-ffff881070c95c00: Connection restored to
> > 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 04:07:03 node872 kernel:
> > Lustre: Skipped 20 previous similar messages May  1 04:07:28 node872
> > kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event
> > type 1, status -5, desc ffff88081d982000 May  1 04:07:28 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:07:28 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 04:07:28 node872 kernel:
> > Lustre: Skipped 20 previous similar messages May  1 04:07:28 node872
> > kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event
> > type 1, status -5, desc ffff88081d982000 May  1 04:07:28 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:07:28 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88081d982000 May  1 04:07:28 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:07:28 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88081d982000 May  1 04:07:28 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:10:30 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88081d982000 May  1 04:10:30 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:10:30 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88081d982000 May  1 04:10:30 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:10:30 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88081d982000 May  1 04:10:30 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:10:30 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88081d982000 May  1 04:10:30 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:13:32 node872 kernel:
> > LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too
> > many fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256
> > dst idx/frags: 249/256 May  1 04:13:32 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> > 10.21.10.116 at o2ib: -90 May  1 04:13:32 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 23 previous similar
> > messages May  1 04:13:32 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:13:32 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:13:32 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:13:32 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:13:32 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:13:32 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:13:32 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:13:32 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 23 previous
> > similar messages May  1 04:13:32 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:14:30 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:14:30 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:14:30 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:14:30 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:14:30 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:14:30 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:14:30 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:14:30 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:16:41 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for slow reply: [sent 1493604995/real 1493604995] 
> > req at ffff881056dd30c0 x1565604440540644/t0(0)
> > o8->astro-OST0002-osc-ffff881070c95c00 at 10.21.10.115@o2ib:28/4 lens
> > 520/544 e 0 to 1 dl 1493605001 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 04:16:41 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 66 previous
> > similar messages May  1 04:17:00 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:17:00 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:17:00 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:17:00 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:17:00 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:17:00 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:17:00 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:17:00 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:17:32 node872 kernel: Lustre:
> > astro-OST0001-osc-ffff881070c95c00: Connection to astro-OST0001 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 04:17:32 node872 kernel:
> > Lustre: Skipped 25 previous similar messages May  1 04:19:37 node872
> > kernel: Lustre: astro-OST0001-osc-ffff881070c95c00: Connection restored
> > to 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 04:19:37 node872
> > kernel: Lustre: Skipped 26 previous similar messages May  1 04:20:02
> > node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:20:02 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:20:02 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:20:02 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:20:02 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:20:02 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:20:02 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:20:02 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:23:04 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:23:04 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:23:04 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:23:04 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:23:04 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:23:04 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:23:04 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:23:04 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:26:06 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
> > fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 249/256 dst
> > idx/frags: 249/256 May  1 04:26:06 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> > 10.21.10.116 at o2ib: -90 May  1 04:26:06 node872 kernel: LNetError:
> > 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Skipped 39 previous similar
> > messages May  1 04:26:06 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:26:06 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:26:06 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:26:06 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:26:06 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:26:06 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:26:06 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:26:06 node872 kernel: LNetError:
> > 5544:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) Skipped 39 previous
> > similar messages May  1 04:26:06 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:26:44 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
> > timed out for slow reply: [sent 1493605598/real 1493605598] 
> > req at ffff88081c534080 x1565604440541864/t0(0)
> > o8->astro-OST0001-osc-ffff881070c95c00 at 10.21.10.116@o2ib:28/4 lens
> > 520/544 e 0 to 1 dl 1493605604 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May  1
> > 04:26:44 node872 kernel: Lustre:
> > 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 65 previous
> > similar messages May  1 04:27:08 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:27:08 node872 kernel: LustreError:
> > 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:27:08 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:27:08 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:27:08 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:27:08 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:27:08 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88081d982000 May  1 04:27:08 node872 kernel: LustreError:
> > 5544:0:(events.c:201:client_bulk_callback()) event type 1, status -5,
> > desc ffff88103dd63000 May  1 04:29:38 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection restored to
> > 10.21.10.116 at o2ib (at 10.21.10.116 at o2ib) May  1 04:29:38 node872 kernel:
> > Lustre: Skipped 22 previous similar messages May  1 04:29:38 node872
> > kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event
> > type 1, status -5, desc ffff88081d982000 May  1 04:29:38 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:29:38 node872 kernel: Lustre:
> > astro-OST0002-osc-ffff881070c95c00: Connection to astro-OST0002 (at
> > 10.21.10.116 at o2ib) was lost; in progress operations using this service
> > will wait for recovery to complete May  1 04:29:38 node872 kernel:
> > Lustre: Skipped 22 previous similar messages May  1 04:29:38 node872
> > kernel: LustreError: 5544:0:(events.c:201:client_bulk_callback()) event
> > type 1, status -5, desc ffff88081d982000 May  1 04:29:38 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:29:38 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88081d982000 May  1 04:29:38 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:29:38 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88081d982000 May  1 04:29:38 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:32:40 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88081d982000 May  1 04:32:40 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000 May  1 04:32:40 node872 kernel:
> > LustreError: 5544:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88081d982000 May  1 04:32:40 node872 kernel:
> > LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1,
> > status -5, desc ffff88103dd63000
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list