[lustre-discuss] Clients looses IB connection to OSS.

Oucharek, Doug S doug.s.oucharek at intel.com
Mon May 1 08:52:48 PDT 2017

For the “RDMA has too many fragments” issue, you need newly landed patch: http://review.whamcloud.com/12451.  For the slow access, not sure if that is related to the too many fragments error.  Once you get the too many fragments error, that node usually needs to unload/reload the LNet module to recover.


On May 1, 2017, at 7:47 AM, Hans Henrik Happe <happe at nbi.ku.dk<mailto:happe at nbi.ku.dk>> wrote:


We have experienced problems with loosing connection to OSS. It starts with:

May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
fragments for peer at o2ib (256), src idx/frags: 128/236 dst
idx/frags: 128/236
May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from at o2ib: -90

The rest of the log is attached.

After this Lustre access is very slow. I.e. a 'df' can take minutes.
Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add'
makes ping work again until file I/O starts. Then I/O errors again.

We use both IB and TCP on servers, so no routers.

In the attached log astro-OST0001 has been moved to the other server in
the HA pair. This is because 'lctl dl -t' showed strange output when on
the right server:

# lctl dl -t
 0 UP mgc MGC10.21.10.102 at o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5
 1 UP lov astro-clilov-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
 2 UP lmv astro-clilmv-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
 3 UP mdc astro-MDT0000-mdc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 at o2ib
 4 UP osc astro-OST0002-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 at o2ib
 5 UP osc astro-OST0001-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 at tcp1
 6 UP osc astro-OST0003-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 at o2ib
 7 UP osc astro-OST0000-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 at o2ib

So astro-OST0001 seems to be connected through at tcp1, even
though it uses at o2ib (verified by performance test and
disabling tcp1 on IB nodes).

Please ask for more details if needed.

Hans Henrik

lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170501/94f84df9/attachment.htm>

More information about the lustre-discuss mailing list