<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Dear all.</p>
<p>I have set up a lustre system, the normal Intel Xeon nodes(with
InfiniBand or Omni Path) are works well, but the KNL nodes reposes
very slow even executing ls command.<br>
</p>
<p>The lustre server has Mellanox 56Gbps FDR InfiniBand, and the
lustre client on Intel Xeon Phi KNL is 100Gbps Intel Omni Path,
there are two LNet nodes(with InfiniBand and OPA card) between
them.</p>
<ul>
<li>OS: CentOS 7.3.1611, 3.10.0-514.21.1.el7.x86_64</li>
<li>Lustre version: 2.9.58</li>
<li> lustre.conf:</li>
<ul>
<li>Server: options lnet networks="o2ib(ib0),tcp0(enp4s0f0)"
routes="o2ib2 10.10.100.[11-12]@o2ib0"<br>
</li>
<li>LNet: options lnet networks="o2ib(ib0),o2ib2(ib1)"
forwarding="enabled"<br>
</li>
<li>KNL Client: options lnet networks="o2ib2(ib0),tcp0(eno1)"
routes="o2ib 10.11.100.[11-12]@o2ib2"</li>
</ul>
<li>KNL client:</li>
<ul>
<li>mount -t lustre bio1@o2ib:bio2@o2ib:/sgfs /home</li>
<li>messages:<br>
5400 x1570792395523952/t0(0)
o3-><a class="moz-txt-link-abbreviated" href="mailto:sgfs-OST0000-osc-ffff8817c1591800@10.10.100.1@o2ib:6/4">sgfs-OST0000-osc-ffff8817c1591800@10.10.100.1@o2ib:6/4</a>
lens 608/432 e 0 to 1 dl 1498025346 ref 2 fl Rpc:X/2/ffffffff
rc 0/-1<br>
[ 1149.934346] Lustre: sgfs-OST0000-osc-ffff8817c1591800:
Connection to sgfs-OST0000 (at 10.10.100.1@o2ib) was lost; in
progress operations using this service will wait for recovery
to complete<br>
[ 1149.935869] Lustre: sgfs-OST0000-osc-ffff8817c1591800:
Connection restored to 10.10.100.1@o2ib (at 10.10.100.1@o2ib)<br>
[ 1425.937894] Lustre:
4458:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1498025484/real
1498025484] req@ffff8817aae2d700 x1570792395526160/t0(0)
o3-><a class="moz-txt-link-abbreviated" href="mailto:sgfs-OST0001-osc-ffff8817c1591800@10.10.100.2@o2ib:6/4">sgfs-OST0001-osc-ffff8817c1591800@10.10.100.2@o2ib:6/4</a>
lens 608/432 e 0 to 1 dl 1498025622 ref 2 fl Rpc:X/2/ffffffff
rc 0/-1<br>
[ 1425.937910] Lustre:
4458:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1
previous similar message<br>
[ 1425.937946] Lustre: sgfs-OST0001-osc-ffff8817c1591800:
Connection to sgfs-OST0001 (at 10.10.100.2@o2ib) was lost; in
progress operations using this service will wait for recovery
to complete<br>
[ 1425.937952] Lustre: Skipped 1 previous similar message<br>
[ 1425.939062] Lustre: sgfs-OST0001-osc-ffff8817c1591800:
Connection restored to 10.10.100.2@o2ib (at 10.10.100.2@o2ib)<br>
[ 1425.939074] Lustre: Skipped 1 previous similar message<br>
[ 5417.993680] Lustre:
4451:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1498029494/real
1498029494] req@ffff8817acb78000 x1570792395563856/t0(0)
o3-><a class="moz-txt-link-abbreviated" href="mailto:sgfs-OST0001-osc-ffff8817c1591800@10.10.100.2@o2ib:6/4">sgfs-OST0001-osc-ffff8817c1591800@10.10.100.2@o2ib:6/4</a>
lens 608/432 e 3 to 1 dl 1498029614 ref 2 fl Rpc:X/0/ffffffff
rc 0/-1<br>
[ 5417.993697] Lustre:
4451:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1
previous similar message<br>
[ 5417.993733] Lustre: sgfs-OST0001-osc-ffff8817c1591800:
Connection to sgfs-OST0001 (at 10.10.100.2@o2ib) was lost; in
progress operations using this service will wait for recovery
to complete<br>
[ 5417.993741] Lustre: Skipped 1 previous similar message<br>
[ 5417.995023] Lustre: sgfs-OST0001-osc-ffff8817c1591800:
Connection restored to 10.10.100.2@o2ib (at 10.10.100.2@o2ib)<br>
</li>
</ul>
</ul>
<ul>
<li>Server:<br>
</li>
</ul>
<blockquote>
<blockquote>
<p>Jun 21 15:19:59 io2 kernel: LustreError:
32679:0:(ldlm_lib.c:3237:target_bulk_io()) @@@ timeout on bulk
READ after 100+0s req@ffff8801048ac450
x1570792395563856/t0(0)
o3-><a class="moz-txt-link-abbreviated" href="mailto:460639e5-a63b-dcbb-5608-f2d814c8397c@10.11.151.1@o2ib2:142/0">460639e5-a63b-dcbb-5608-f2d814c8397c@10.11.151.1@o2ib2:142/0</a>
lens 608/432 e 3 to 0 dl 1498029617 ref 1 fl Interpret:/0/0 rc
0/0<br>
Jun 21 15:19:59 io2 kernel: LustreError:
32679:0:(ldlm_lib.c:3237:target_bulk_io()) Skipped 1 previous
similar message<br>
Jun 21 15:19:59 io2 kernel: Lustre: sgfs-OST0001: Bulk IO read
error with 460639e5-a63b-dcbb-5608-f2d814c8397c (at
10.11.151.1@o2ib2), client will retry: rc -110<br>
I have tried ethernet, but it is same.<br>
</p>
</blockquote>
</blockquote>
<p>Can you help me? Thank you very much.<br>
</p>
<p><br>
</p>
</body>
</html>