[lustre-discuss] Added OSTs, now lnet errors

Steve Barnet barnet at icecube.wisc.edu
Sun Dec 11 17:05:13 PST 2016


Hi Brett,


On 12/11/16 4:46 PM, Brett Lee wrote:
> Steve, It might be the network that LNet is running on.  Have you run
> some bandwidth tests without LNet to check for network problems?


It's running over a 10Gb/s Ethernet network that is carrying
other OSS traffic successfully. No routers or other fancy LNET
features in play. However, it is quite possible that there are
issues with the networking on the host side. Definitely on my
list of things to test out.

   At this point, I'm just trying to narrow the search space.
I didn't find anything particularly revealing when I searched
around, so I'm hoping some expert eyes can shine a bit of
light on the situation.

Thanks for the tip!

Best,

---Steve

>
> On Dec 11, 2016 3:37 PM, "Steve Barnet" <barnet at icecube.wisc.edu
> <mailto:barnet at icecube.wisc.edu>> wrote:
>
>     Hi all,
>
>       Seeing something very strange. I recently added two OSSes
>     and 10 OSTs to one of our filesystems. Things look OK under
>     light loads, but when we load them up, we start seeing lots
>     of LNet errors.
>
>     OS: Scientific Linux 6.7
>     Lustre - Server: 2.8.0 Community version
>     Lustre - Client: 2.5.3
>
>     The errors are below. Do these narrow the range of possible
>     problems?
>
>
>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LNetError:
>     7732:0:(socklnd_cb.c:2509:ksocknal_check_peer_timeouts()) Total 4
>     stale ZC_REQs for peer 10.128.10.29 at tcp1 detected; the
>     oldest(ffff880f6a90e000) timed out 7 secs ago, resid: 0, wmem: 0
>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>     7732:0:(events.c:447:server_bulk_callback()) event type 5, status
>     -5, desc ffff8805379f8000
>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>     7732:0:(events.c:447:server_bulk_callback()) event type 5, status
>     -5, desc ffff880f375dc000
>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>     8234:0:(ldlm_lib.c:3175:target_bulk_io()) @@@ network error on bulk
>     READ  req at ffff880e506263c0 x1551187318090340/t0(0)
>     o3->092e941d-272a-09e3-502b-9338dbf387d3 at 10.128.10.29@tcp1:587/0
>     lens 488/432 e 3 to 0 dl 1481476687 ref 1 fl Interpret:/0/0 rc 0/0
>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>     8234:0:(ldlm_lib.c:3175:target_bulk_io()) Skipped 1 previous similar
>     message
>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: Lustre: lfs2-OST0024: Bulk IO
>     read error with 092e941d-272a-09e3-502b-9338dbf387d3 (at
>     10.128.10.29 at tcp1), client will retry: rc -110
>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>     7732:0:(events.c:447:server_bulk_callback()) event type 5, status
>     -5, desc ffff8804db0ce000
>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>     7732:0:(events.c:447:server_bulk_callback()) event type 5, status
>     -5, desc ffff880aa4374000
>
>
>     Thanks much!
>
>     Best,
>
>     ---Steve
>
>     _______________________________________________
>     lustre-discuss mailing list
>     lustre-discuss at lists.lustre.org <mailto:lustre-discuss at lists.lustre.org>
>     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>     <http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
>



More information about the lustre-discuss mailing list