[lustre-discuss] o2ib (ib_qib) with 2.7.0 rpms on centos 6.6

Chris Hunter chris.hunter at yale.edu
Thu Nov 19 09:18:56 PST 2015


FYI,
We encountered another issue when using RHEL IB kernel drivers. There 
have been changes to ko2iblnd module parameters on lustre clients that 
are not compatible with the RHEL IB stack.
The changes are intended to improve performance on lustre when using 
truescale hardware, and support newer generation of mellanox adapters. 
Details are in LU-3322, LU-6723, LU-7101 From the ticket notes, its 
pretty clear RHEL rdma stack is not part of their testing recipe.

General overview of performance tuning is in openfabrics presentation:
http://downloads.openfabrics.org/downloads/Media/OFSUG_2015/Friday/friday_02.pdf

FWIW, ko2iblnd tuning failed for us in a mixed IB environment where 
lustre servers use different HCAs from the lustre clients. Our solution 
was to modify the modprobe ko2iblnd config and remove the new paramaters 
(ie. use the defaults).

regards,
chris hunter
chris.hunter at yale.edu

On 11/19/2015 10:33 AM, Lassus, Magnus wrote:
> Thank you very much Chris. True scale it is and using 2.6.32-504.23.4 solved it.
>
> Regards,
> Magnus
>
> -----Original Message-----
> From: Chris Hunter [mailto:chris.hunter at yale.edu]
> Sent: 18 November 2015 23:14
> To: Lassus, Magnus <magnus.lassus at wartsila.com>
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [lustre-discuss] o2ib (ib_qib) with 2.7.0 rpms on centos 6.6
>
> Are you using truescale IB interfaces ?
>
> There is a known truescale bug in rhel/centos 6.6 kernels. You should
> try kernel 2.6.32-504.23.4 or newer. Some details of the bug are in
> LU-6698 and RHSA-2015-1081.
>
> regards,
> chris hunter
> yale hpc group
>
>> From: "Lassus, Magnus" <magnus.lassus at wartsila.com>
>> To: "lustre-discuss at lists.lustre.org"
>> 	<lustre-discuss at lists.lustre.org>
>> Subject: [lustre-discuss] o2ib (ib_qib) with 2.7.0 rpms on centos 6.6:
>> 	LNetError: kiblnd_init_rdma: Src buffer exhausted: 1 frags
>> Message-ID:
>> 	<HE1PR04MB1273C36E676E1824D8B2E3A4941C0 at HE1PR04MB1273.eurprd04.prod.outlook.com>
>> 	
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hi,
>>
>> I fail to understand where I go wrong in getting o2ib working using 2.7.0 rpms on top of CentOS 6.6. Running selftest I see:
>>
>> Nov 17 18:22:40 ss08 kernel: LNet: Added LNI 10.165.32.18 at o2ib [8/256/0/180]
>> Nov 17 18:24:40 ss08 kernel: LNetError: 12532:0:(o2iblnd_cb.c:1123:kiblnd_init_rdma()) Src buffer exhausted: 1 frags
>> Nov 17 18:24:40 ss08 kernel: LustreError: 12553:0:(brw_test.c:212:brw_check_page()) Bad data in page ffffea0070c20800: 0xbeefbeefbeefbeef, 0xeeb0eeb1eeb2eeb3 expec
>> Nov 17 18:24:40 ss08 kernel: LustreError: 12553:0:(brw_test.c:238:brw_check_bulk()) Bulk page ffffea0070c20800 (0/256) is corrupted!
>> Nov 17 18:24:40 ss08 kernel: LustreError: 12553:0:(brw_test.c:343:brw_client_done_rpc()) Bulk data from 12345-10.165.32.18 at o2ib is corrupted!
>> Nov 17 18:24:40 ss08 kernel: LNetError: 12532:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Can't setup rdma for GET from 10.165.32.18 at o2ib: -71
>> Nov 17 18:25:31 ss08 kernel: LNetError: 12529:0:(o2iblnd_cb.c:3036:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
>> Nov 17 18:25:31 ss08 kernel: LNetError: 12529:0:(o2iblnd_cb.c:3099:kiblnd_check_conns()) Timed out RDMA with 10.165.32.18 at o2ib (0): c: 7, oc: 0, rc: 7
>> Nov 17 18:25:31 ss08 kernel: LustreError: 12558:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.165.32.18 at o2ib: -103
>> Nov 17 18:25:31 ss08 kernel: LustreError: 12558:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.165.32.18 at o2ib has failed: -5
>> Nov 17 18:25:48 ss08 kernel: LNet: 12581:0:(rpc.c:1077:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.165.32.18 at o2ib, timeout 64.
>> Nov 17 18:25:48 ss08 kernel: LustreError: 12555:0:(brw_test.c:318:brw_client_done_rpc()) BRW RPC to 12345-10.165.32.18 at o2ib failed with -110
>>
>> # rpm -qa | egrep 'lustre|kernel' | sort
>> dracut-kernel-004-356.el6.noarch
>> kernel-2.6.32-504.8.1.el6_lustre.x86_64
>> kernel-devel-2.6.32-504.8.1.el6_lustre.x86_64
>> kernel-firmware-2.6.32-504.8.1.el6_lustre.x86_64
>> kernel-headers-2.6.32-504.8.1.el6_lustre.x86_64
>> lustre-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
>> lustre-iokit-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
>> lustre-modules-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
>> lustre-osd-ldiskfs-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
>> lustre-osd-ldiskfs-mount-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
>> lustre-tests-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
>> perf-2.6.32-504.8.1.el6_lustre.x86_64
>> python-perf-2.6.32-504.8.1.el6_lustre.x86_64
>>
>> Using latest 2.7.63 build on 6.7 works.
>>
>> Any pointers are warmly welcome as I'd prefer to use 2.7.0.
>>
>> Regards,
>> Magnus
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_pipermail_lustre-2Ddiscuss-2Dlustre.org_attachments_20151118_bc19b61a_attachment.html&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=d_G2h_sZYG4xtHMeKo8QgjDmOcMVdQvYgM-5Dri1AOY&m=yntd6s6FbhcK6yz7f--sTQB8uauio2sPpZXJO07_GMM&s=fmaW2S-MSdcgBPqEnTVELb9GaBrR0zwaQlFI9_QrbYw&e= >
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=d_G2h_sZYG4xtHMeKo8QgjDmOcMVdQvYgM-5Dri1AOY&m=yntd6s6FbhcK6yz7f--sTQB8uauio2sPpZXJO07_GMM&s=XPhf61e64WjkcxWw05wudsYWLfRBfsN0OiJF8O2DYE4&e=
>>
>>
>> ------------------------------
>>
>> End of lustre-discuss Digest, Vol 116, Issue 9
>> **********************************************
>>


More information about the lustre-discuss mailing list