<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

For the “RDMA has too many fragments” issue, you need newly landed patch: <a href="http://review.whamcloud.com/12451" class="">http://review.whamcloud.com/12451</a>.  For the slow access, not sure if that is related to the too many fragments error.  Once you

 get the too many fragments error, that node usually needs to unload/reload the LNet module to recover.

<div class=""><br class="">

</div>

<div class="">Doug</div>

<div class=""><br class="">

<div>

<blockquote type="cite" class="">

<div class="">On May 1, 2017, at 7:47 AM, Hans Henrik Happe <<a href="mailto:happe@nbi.ku.dk" class="">happe@nbi.ku.dk</a>> wrote:</div>

<br class="Apple-interchange-newline">

<div class="">

<div class="">Hi,<br class="">

<br class="">

We have experienced problems with loosing connection to OSS. It starts with:<br class="">

<br class="">

May  1 03:35:46 node872 kernel: LNetError:<br class="">

5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many<br class="">

fragments for peer 10.21.10.116@o2ib (256), src idx/frags: 128/236 dst<br class="">

idx/frags: 128/236<br class="">

May  1 03:35:46 node872 kernel: LNetError:<br class="">

5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from<br class="">

10.21.10.116@o2ib: -90<br class="">

<br class="">

The rest of the log is attached.<br class="">

<br class="">

After this Lustre access is very slow. I.e. a 'df' can take minutes.<br class="">

Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add'<br class="">

makes ping work again until file I/O starts. Then I/O errors again.<br class="">

<br class="">

We use both IB and TCP on servers, so no routers.<br class="">

<br class="">

In the attached log astro-OST0001 has been moved to the other server in<br class="">

the HA pair. This is because 'lctl dl -t' showed strange output when on<br class="">

the right server:<br class="">

<br class="">

# lctl dl -t<br class="">

 0 UP mgc MGC10.21.10.102@o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5<br class="">

 1 UP lov astro-clilov-ffff88107412e800<br class="">

53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4<br class="">

 2 UP lmv astro-clilmv-ffff88107412e800<br class="">

53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4<br class="">

 3 UP mdc astro-MDT0000-mdc-ffff88107412e800<br class="">

53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.102@o2ib<br class="">

 4 UP osc astro-OST0002-osc-ffff88107412e800<br class="">

53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.116@o2ib<br class="">

 5 UP osc astro-OST0001-osc-ffff88107412e800<br class="">

53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 172.20.10.115@tcp1<br class="">

 6 UP osc astro-OST0003-osc-ffff88107412e800<br class="">

53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.117@o2ib<br class="">

 7 UP osc astro-OST0000-osc-ffff88107412e800<br class="">

53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.114@o2ib<br class="">

<br class="">

So astro-OST0001 seems to be connected through 172.20.10.115@tcp1, even<br class="">

though it uses 10.21.10.115@o2ib (verified by performance test and<br class="">

disabling tcp1 on IB nodes).<br class="">

<br class="">

Please ask for more details if needed.<br class="">

<br class="">

Cheers,<br class="">

Hans Henrik<br class="">

<br class="">

<span id="cid:73C79D1E-B86B-4BC1-BD68-B71ADE864896@amr.corp.intel.com"><client.log></span>_______________________________________________<br class="">

lustre-discuss mailing list<br class="">

<a href="mailto:lustre-discuss@lists.lustre.org" class="">lustre-discuss@lists.lustre.org</a><br class="">

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<br class="">

</div>

</div>

</blockquote>

</div>

<br class="">

</div>

</body>

</html>