[lustre-discuss] Clients looses IB connection to OSS.

Hans Henrik Happe happe at nbi.ku.dk
Mon May 1 07:47:32 PDT 2017


Hi,

We have experienced problems with loosing connection to OSS. It starts with:

May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
fragments for peer 10.21.10.116 at o2ib (256), src idx/frags: 128/236 dst
idx/frags: 128/236
May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
10.21.10.116 at o2ib: -90

The rest of the log is attached.

After this Lustre access is very slow. I.e. a 'df' can take minutes.
Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add'
makes ping work again until file I/O starts. Then I/O errors again.

We use both IB and TCP on servers, so no routers.

In the attached log astro-OST0001 has been moved to the other server in
the HA pair. This is because 'lctl dl -t' showed strange output when on
the right server:

# lctl dl -t
  0 UP mgc MGC10.21.10.102 at o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5
  1 UP lov astro-clilov-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
  2 UP lmv astro-clilmv-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
  3 UP mdc astro-MDT0000-mdc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.102 at o2ib
  4 UP osc astro-OST0002-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.116 at o2ib
  5 UP osc astro-OST0001-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 172.20.10.115 at tcp1
  6 UP osc astro-OST0003-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.117 at o2ib
  7 UP osc astro-OST0000-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.114 at o2ib

So astro-OST0001 seems to be connected through 172.20.10.115 at tcp1, even
though it uses 10.21.10.115 at o2ib (verified by performance test and
disabling tcp1 on IB nodes).

Please ask for more details if needed.

Cheers,
Hans Henrik

-------------- next part --------------
A non-text attachment was scrubbed...
Name: client.log
Type: text/x-log
Size: 46406 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170501/7d9cab07/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170501/7d9cab07/attachment-0001.pgp>


More information about the lustre-discuss mailing list