[lustre-discuss] Lustre poor performance

Tue Aug 22 09:22:43 PDT 2017

You may want to file a jira ticket if ko2iblnd-opa setting were being automatically used on your Mellanox setup.  That is not expected.

On another note:  As you note you NVMe backend is much faster than QRD link speed.  You may want to look at using the new Multi-rall lnet feature to boost network bandwidth.  You can add a 2nd QRD HCA/Port and get more Lnet bandwith from your OSS server.   It is a new feature that is a bit of work to use but if you are chasing bandwith it might be worth the effort.

Thanks,
Keith

From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Chris Horn
Sent: Monday, August 21, 2017 12:40 PM
To: Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it>; Arman Khalatyan <arm2arm at gmail.com>
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [lustre-discuss] Lustre poor performance

The ko2iblnd-opa settings are tuned specifically for Intel OmniPath. Take a look at the /usr/sbin/ko2iblnd-probe script to see how OPA hardware is detected and the “ko2iblnd-opa” settings get used.

Chris Horn

From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it<mailto:Riccardo.Veraldi at cnaf.infn.it>>
Date: Saturday, August 19, 2017 at 5:00 PM
To: Arman Khalatyan <arm2arm at gmail.com<mailto:arm2arm at gmail.com>>
Cc: "lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>" <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Subject: Re: [lustre-discuss] Lustre poor performance

I ran again my Lnet self test and  this time adding --concurrency=16  I can use all of the IB bandwith (3.5GB/sec).

the only thing I do not understand is why ko2iblnd.conf is not loaded properly and I had to remove the alias in the config file to allow
the proper peer_credit settings to be loaded.

thanks to everyone for helping

Riccardo

On 8/19/17 8:54 AM, Riccardo Veraldi wrote:

I found out that ko2iblnd is not getting settings from /etc/modprobe/ko2iblnd.conf
alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe

but if I modify ko2iblnd.conf like this, then settings are loaded:

options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe

Lnet tests show better behaviour but still I Would expect more than this.
Is it possible to tune parameters in /etc/modprobe/ko2iblnd.conf so that Mellanox ConnectX-3 will work more efficiently ?

[LNet Rates of servers]
[R] Avg: 2286     RPC/s Min: 0        RPC/s Max: 4572     RPC/s
[W] Avg: 3322     RPC/s Min: 0        RPC/s Max: 6643     RPC/s
[LNet Bandwidth of servers]
[R] Avg: 625.23   MiB/s Min: 0.00     MiB/s Max: 1250.46  MiB/s
[W] Avg: 1035.85  MiB/s Min: 0.00     MiB/s Max: 2071.69  MiB/s
[LNet Rates of servers]
[R] Avg: 2286     RPC/s Min: 1        RPC/s Max: 4571     RPC/s
[W] Avg: 3321     RPC/s Min: 1        RPC/s Max: 6641     RPC/s
[LNet Bandwidth of servers]
[R] Avg: 625.55   MiB/s Min: 0.00     MiB/s Max: 1251.11  MiB/s
[W] Avg: 1035.05  MiB/s Min: 0.00     MiB/s Max: 2070.11  MiB/s
[LNet Rates of servers]
[R] Avg: 2291     RPC/s Min: 0        RPC/s Max: 4581     RPC/s
[W] Avg: 3329     RPC/s Min: 0        RPC/s Max: 6657     RPC/s
[LNet Bandwidth of servers]
[R] Avg: 626.55   MiB/s Min: 0.00     MiB/s Max: 1253.11  MiB/s
[W] Avg: 1038.05  MiB/s Min: 0.00     MiB/s Max: 2076.11  MiB/s
session is ended
./lnet_test.sh: line 17: 23394 Terminated              lst stat servers

On 8/19/17 4:20 AM, Arman Khalatyan wrote:
just minor comment,
you should push up performance of your nodes,they are not running in the max cpu frequencies.Al tests might be inconsistent. in order to get most of ib run following:
tuned-adm profile latency-performance
for more options use:
tuned-adm list

It will be interesting to see the difference.

Am 19.08.2017 3:57 vorm. schrieb "Riccardo Veraldi" <Riccardo.Veraldi at cnaf.infn.it<mailto:Riccardo.Veraldi at cnaf.infn.it>>:
Hello Keith and Dennis, these are the test I ran.

  *   obdfilter-survey, shows that I Can saturate disk performance, the NVMe/ZFS backend is performing very well and it is faster then my Infiniband network

pool          alloc   free   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
drpffb-ost01  3.31T  3.19T      3  35.7K  16.0K  7.03G
  raidz1      3.31T  3.19T      3  35.7K  16.0K  7.03G
    nvme0n1       -      -      1  5.95K  7.99K  1.17G
    nvme1n1       -      -      0  6.01K      0  1.18G
    nvme2n1       -      -      0  5.93K      0  1.17G
    nvme3n1       -      -      0  5.88K      0  1.16G
    nvme4n1       -      -      1  5.95K  7.99K  1.17G
    nvme5n1       -      -      0  5.96K      0  1.17G
------------  -----  -----  -----  -----  -----  -----
this are the tests results

Fri Aug 18 16:54:48 PDT 2017 Obdfilter-survey for case=disk from drp-tst-ffb01
ost  1 sz 10485760K rsz 1024K obj    1 thr    1 write 7633.08             SHORT rewrite 7558.78             SHORT read 3205.24 [3213.70, 3226.78]
ost  1 sz 10485760K rsz 1024K obj    1 thr    2 write 7996.89             SHORT rewrite 7903.42             SHORT read 5264.70             SHORT
ost  1 sz 10485760K rsz 1024K obj    2 thr    2 write 7718.94             SHORT rewrite 7977.84             SHORT read 5802.17             SHORT

  *   Lnet self test, and here I see the problems. For reference 172.21.52.[83,84] are the two OSSes 172.21.52.86 is the reader/writer. Here is the script that I ran

#!/bin/bash
export LST_SESSION=$$
lst new_session read_write
lst add_group servers 172.21.52.[83,84]@o2ib5
lst add_group readers 172.21.52.86 at o2ib5<mailto:172.21.52.86 at o2ib5>
lst add_group writers 172.21.52.86 at o2ib5<mailto:172.21.52.86 at o2ib5>
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers \
brw read check=simple size=1M
lst add_test --batch bulk_rw --from writers --to servers \
brw write check=full size=1M
# start running
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session

here the results

SESSION: read_write FEATURES: 1 TIMEOUT: 300 FORCE: No
172.21.52.[83,84]@o2ib5 are added to session
172.21.52.86 at o2ib5<mailto:172.21.52.86 at o2ib5> are added to session
172.21.52.86 at o2ib5<mailto:172.21.52.86 at o2ib5> are added to session
Test was added successfully
Test was added successfully
bulk_rw is running now
[LNet Rates of servers]
[R] Avg: 1751     RPC/s Min: 0        RPC/s Max: 3502     RPC/s
[W] Avg: 2525     RPC/s Min: 0        RPC/s Max: 5050     RPC/s
[LNet Bandwidth of servers]
[R] Avg: 488.79   MiB/s Min: 0.00     MiB/s Max: 977.59   MiB/s
[W] Avg: 773.99   MiB/s Min: 0.00     MiB/s Max: 1547.99  MiB/s
[LNet Rates of servers]
[R] Avg: 1718     RPC/s Min: 0        RPC/s Max: 3435     RPC/s
[W] Avg: 2479     RPC/s Min: 0        RPC/s Max: 4958     RPC/s
[LNet Bandwidth of servers]
[R] Avg: 478.19   MiB/s Min: 0.00     MiB/s Max: 956.39   MiB/s
[W] Avg: 761.74   MiB/s Min: 0.00     MiB/s Max: 1523.47  MiB/s
[LNet Rates of servers]
[R] Avg: 1734     RPC/s Min: 0        RPC/s Max: 3467     RPC/s
[W] Avg: 2506     RPC/s Min: 0        RPC/s Max: 5012     RPC/s
[LNet Bandwidth of servers]
[R] Avg: 480.79   MiB/s Min: 0.00     MiB/s Max: 961.58   MiB/s
[W] Avg: 772.49   MiB/s Min: 0.00     MiB/s Max: 1544.98  MiB/s
[LNet Rates of servers]
[R] Avg: 1722     RPC/s Min: 0        RPC/s Max: 3444     RPC/s
[W] Avg: 2486     RPC/s Min: 0        RPC/s Max: 4972     RPC/s
[LNet Bandwidth of servers]
[R] Avg: 479.09   MiB/s Min: 0.00     MiB/s Max: 958.18   MiB/s
[W] Avg: 764.19   MiB/s Min: 0.00     MiB/s Max: 1528.38  MiB/s
[LNet Rates of servers]
[R] Avg: 1741     RPC/s Min: 0        RPC/s Max: 3482     RPC/s
[W] Avg: 2513     RPC/s Min: 0        RPC/s Max: 5025     RPC/s
[LNet Bandwidth of servers]
[R] Avg: 484.59   MiB/s Min: 0.00     MiB/s Max: 969.19   MiB/s
[W] Avg: 771.94   MiB/s Min: 0.00     MiB/s Max: 1543.87  MiB/s
session is ended
./lnet_test.sh: line 17:  4940 Terminated              lst stat servers
so looks like Lnet is really under performing  going at least half and less than InfiniBand capabilities.
How can I find out what is causing this ?

running perf tools tests with infiniband tools I have good results:

************************************
* Waiting for client to connect... *
************************************

---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF        Device         : mlx4_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 RX depth        : 512
 CQ Moderation   : 100
 Mtu             : 2048[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x07 QPN 0x020f PSN 0xacc37a
 remote address: LID 0x0a QPN 0x020f PSN 0x91a069
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 1249.234000 != 1326.000000. CPU Frequency is not max.
 2          1000             0.00               11.99             6.285330
Conflicting CPU frequency values detected: 1314.910000 != 1395.460000. CPU Frequency is not max.
 4          1000             0.00               28.26             7.409324
Conflicting CPU frequency values detected: 1314.910000 != 1460.207000. CPU Frequency is not max.
 8          1000             0.00               54.47             7.139164
Conflicting CPU frequency values detected: 1314.910000 != 1244.320000. CPU Frequency is not max.
 16         1000             0.00               113.13            7.413889
Conflicting CPU frequency values detected: 1314.910000 != 1460.207000. CPU Frequency is not max.
 32         1000             0.00               226.07            7.407811
Conflicting CPU frequency values detected: 1469.703000 != 1301.031000. CPU Frequency is not max.
 64         1000             0.00               452.12            7.407465
Conflicting CPU frequency values detected: 1469.703000 != 1301.031000. CPU Frequency is not max.
 128        1000             0.00               845.45            6.925918
Conflicting CPU frequency values detected: 1469.703000 != 1362.257000. CPU Frequency is not max.
 256        1000             0.00               1746.93           7.155406
Conflicting CPU frequency values detected: 1469.703000 != 1362.257000. CPU Frequency is not max.
 512        1000             0.00               2766.93           5.666682
Conflicting CPU frequency values detected: 1296.714000 != 1204.675000. CPU Frequency is not max.
 1024       1000             0.00               3516.26           3.600646
Conflicting CPU frequency values detected: 1296.714000 != 1325.535000. CPU Frequency is not max.
 2048       1000             0.00               3630.93           1.859035
Conflicting CPU frequency values detected: 1296.714000 != 1331.312000. CPU Frequency is not max.
 4096       1000             0.00               3702.39           0.947813
Conflicting CPU frequency values detected: 1296.714000 != 1200.027000. CPU Frequency is not max.
 8192       1000             0.00               3724.82           0.476777
Conflicting CPU frequency values detected: 1384.902000 != 1314.113000. CPU Frequency is not max.
 16384      1000             0.00               3731.21           0.238798
Conflicting CPU frequency values detected: 1578.078000 != 1200.027000. CPU Frequency is not max.
 32768      1000             0.00               3735.32           0.119530
Conflicting CPU frequency values detected: 1578.078000 != 1200.027000. CPU Frequency is not max.
 65536      1000             0.00               3736.98           0.059792
Conflicting CPU frequency values detected: 1578.078000 != 1200.027000. CPU Frequency is not max.
 131072     1000             0.00               3737.80           0.029902
Conflicting CPU frequency values detected: 1578.078000 != 1200.027000. CPU Frequency is not max.
 262144     1000             0.00               3738.43           0.014954
Conflicting CPU frequency values detected: 1570.507000 != 1200.027000. CPU Frequency is not max.
 524288     1000             0.00               3738.50           0.007477
Conflicting CPU frequency values detected: 1457.019000 != 1236.152000. CPU Frequency is not max.
 1048576    1000             0.00               3738.65           0.003739
Conflicting CPU frequency values detected: 1411.597000 != 1234.957000. CPU Frequency is not max.
 2097152    1000             0.00               3738.65           0.001869
Conflicting CPU frequency values detected: 1369.828000 != 1516.851000. CPU Frequency is not max.
 4194304    1000             0.00               3738.80           0.000935
Conflicting CPU frequency values detected: 1564.664000 != 1247.574000. CPU Frequency is not max.
 8388608    1000             0.00               3738.76           0.000467
---------------------------------------------------------------------------------------

RDMA modules are loaded

rpcrdma                90366  0
rdma_ucm               26837  0
ib_uverbs              51854  2 ib_ucm,rdma_ucm
rdma_cm                53755  5 rpcrdma,ko2iblnd,ib_iser,rdma_ucm,ib_isert
ib_cm                  47149  5 rdma_cm,ib_srp,ib_ucm,ib_srpt,ib_ipoib
iw_cm                  46022  1 rdma_cm
ib_core               210381  15 rdma_cm,ib_cm,iw_cm,rpcrdma,ko2iblnd,mlx4_ib,ib_srp,ib_ucm,ib_iser,ib_srpt,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib,ib_isert
sunrpc                334343  17 nfs,nfsd,rpcsec_gss_krb5,auth_rpcgss,lockd,nfsv4,rpcrdma,nfs_acl

I do not know where to look to have Lnet performing faster. I am running my ib0 interface in connected mode with 65520 MTU size.

Any hint will be much appreciated

thank you

Rick

On 8/18/17 9:05 AM, Mannthey, Keith wrote:

I would suggest you a few other tests to help isolate where the issue might be.

1. What is the single thread "DD" write speed?

2. Lnet_selfttest:  Please see " Chapter 28. Testing Lustre Network Performance (LNet Self-Test)" in the Lustre manual if this is a new test for you.

This will help show how much Lnet bandwith you have from your single client.  There are tunable in the lnet later that can affect things.  Which QRD HCA are you using?

3. OBDFilter_survey :  Please see " 29.3. Testing OST Performance (obdfilter-survey)" in the Lustre manual.  This test will help demonstrate what the backed NVMe/ZFS setup can do at the OBD layer in Lustre.

Thanks,

 Keith

-----Original Message-----

From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Riccardo Veraldi

Sent: Thursday, August 17, 2017 10:48 PM

To: Dennis Nelson <dnelson at ddn.com><mailto:dnelson at ddn.com>; lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>

Subject: Re: [lustre-discuss] Lustre poor performance

this is my lustre.conf

[drp-tst-ffb01:~]$ cat /etc/modprobe.d/lustre.conf options lnet networks=o2ib5(ib0),tcp5(enp1s0f0)

data transfer is over infiniband

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65520

        inet 172.21.52.83  netmask 255.255.252.0  broadcast 172.21.55.255

On 8/17/17 10:45 PM, Riccardo Veraldi wrote:

On 8/17/17 9:22 PM, Dennis Nelson wrote:

It appears that you are running iozone on a single client?  What kind of network is tcp5?  Have you looked at the network to make sure it is not the bottleneck?

yes the data transfer is on ib0 interface and I did a memory to memory

test through InfiniBand QDR  resulting in 3.7GB/sec.

tcp is used to connect to the MDS. It is tcp5 to differentiate it from

my other many Lustre clusters. I could have called it tcp but it does

not make any difference performance wise.

I ran the test from one single node yes, I ran the same test also

locally on a zpool identical to the one on the Lustre OSS.

 Ihave 4 identical servers each of them with the aame nvme disks:

server1: OSS - OST1 Lustre/ZFS  raidz1

server2: OSS - OST2 Lustre/ZFS  raidz1

server3: local ZFS raidz1

server4: Lustre client

_______________________________________________

lustre-discuss mailing list

lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________

lustre-discuss mailing list

lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________ lustre-discuss mailing list lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170822/8c09f62a/attachment-0001.htm>