[lustre-discuss] Lnet Self Test
Pinkesh Valdria
pinkesh.valdria at oracle.com
Sat Dec 7 23:11:46 PST 2019
Thanks @Moreno Diego (ID SIS) for a detailed response to my email. It gave me lot of options to further tune my cluster. I have yet to apply those changes, but thought I share the changes I plan to make. Also I have some follow-up questions to ensure the changes I am thinking of applying collectively make sense or not conflict with each other.
On lnet:
Before
/usr/sbin/lnetctl net add --net tcp1 --if eno2 –peer-timeout 180 –peer-credits 8 –credits 1024
After
/usr/sbin/lnetctl net add --net tcp1 --if eno2 –peer-timeout 180 –peer-credits 128 –credits 1024 -peer_buffer_credits 0
Do you have an example of how to set the PCI config to performance? I tried google search, but was unable to find an example.
Currently the RPC size is 4M and related rpcs settings are below
set_param obdfilter.lfsbv-*.brw_size=4
set_param osc.*.max_pages_per_rpc=1024
lctl set_param osc.*.max_rpcs_in_flight=256
lctl set_param osc.*.max_dirty_mb=2048
Should I update brw_size to 16M and related settings to higher value for better performance? If yes, does it also require changes to credits and peer_credits value for lnet.conf and ksocklnd.conf file to ensure there is enough credits to send so many RPC requests. Should max_rpcs_in_flights be less than peer_credits value in lnet.conf or they are not related.
set_param obdfilter.lfsbv-*.brw_size=16
set_param osc.*.max_pages_per_rpc=4096
lctl set_param osc.*.max_rpcs_in_flight=256
lctl set_param osc.*.max_dirty_mb=8092
On ksocklnd module options: more schedulers (10, 6 by default which was not enough for my server), also changed some of the buffers (tx_buffer_size and rx_buffer_size set to 1073741824) but you need to be very careful on these
Response: I had none before.I plan to add the below, based on various Lustre recommendations in Lustre ppt presentations at Lustre UG meetings.
echo "options ksocklnd sock_timeout=100 credits=2560 peer_credits=63 enable_irq_affinity=0 concurrent_sends=63 fmr_pool_size=1280 pmr_pool_size=1280 fmr_flush_trigger=1024 nscheds=10 tx_buffer_size=1073741824 rx_buffer_size=1073741824" > /etc/modprobe.d/ksocklnd.conf
Sysctl.conf: increase buffers (tcp_rmem, tcp_wmem, check window_scaling, net.core.max and default, check disabling timestamps if you can afford it)
Given below are my current settings. My OSS and MDS node have 768 GB memory and 52 physical cores (104 vcpu). And my lustre clients have 320GB memory and 24 physical cores.
echo "net.ipv4.tcp_window_scaling = 1" >> /etc/sysctl.conf
echo "net.ipv4.tcp_adv_win_scale=2" >> /etc/sysctl.conf
echo "net.ipv4.tcp_low_latency=1" >> /etc/sysctl.conf
echo "net.core.wmem_max=16777216" >> /etc/sysctl.conf
echo "net.core.rmem_max=16777216" >> /etc/sysctl.conf
echo "net.core.wmem_default=16777216" >> /etc/sysctl.conf
echo "net.core.rmem_default=16777216" >> /etc/sysctl.conf
echo "net.core.optmem_max=16777216" >> /etc/sysctl.conf
echo "net.core.netdev_max_backlog=27000" >> /etc/sysctl.conf
echo "kernel.sysrq=1" >> /etc/sysctl.conf
echo "kernel.shmmax=18446744073692774399" >> /etc/sysctl.conf
echo "net.core.somaxconn=8192" >> /etc/sysctl.conf
echo "net.ipv4.tcp_rmem = 212992 87380 16777216" >> /etc/sysctl.conf
echo "net.ipv4.tcp_sack = 1" >> /etc/sysctl.conf
echo "net.ipv4.tcp_timestamps = 1" >> /etc/sysctl.conf
echo "net.ipv4.tcp_window_scaling = 1" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem = 212992 65536 16777216" >> /etc/sysctl.conf
echo "vm.min_free_kbytes = 65536" >> /etc/sysctl.conf
echo "net.ipv4.tcp_no_metrics_save = 0" >> /etc/sysctl.conf
echo "net.ipv4.tcp_timestamps = 0" >> /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control = htcp" >> /etc/sysctl.conf
I am running lfs 2.12.3 and Lustre 2.12.1 has a fix for single threaded issue with ksocklnd
http://wiki.lustre.org/Lustre_2.12.1_Changelog has LU-11415: ksocklnd performance improvement on 40Gbps ethernet
[opc at lustre-oss-server-nic0-4 ~]$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
552 root 20 0 0 0 0 S 4.0 0.0 39:37.43 kswapd1
60869 root 20 0 0 0 0 S 4.0 0.0 81:25.20 socknal_sd01_04
60870 root 20 0 0 0 0 S 4.0 0.0 81:14.20 socknal_sd01_05
60865 root 20 0 0 0 0 S 3.6 0.0 81:33.27 socknal_sd01_00
60866 root 20 0 0 0 0 S 3.6 0.0 81:09.03 socknal_sd01_01
60867 root 20 0 0 0 0 S 3.6 0.0 81:11.95 socknal_sd01_02
60868 root 20 0 0 0 0 S 3.6 0.0 81:30.26 socknal_sd01_03
551 root 20 0 0 0 0 S 2.6 0.0 39:24.00 kswapd0
60860 root 20 0 0 0 0 S 2.3 0.0 30:54.35 socknal_sd00_01
60864 root 20 0 0 0 0 S 2.3 0.0 30:58.20 socknal_sd00_05
64426 root 20 0 0 0 0 S 2.3 0.0 7:28.65 ll_ost_io01_102
60859 root 20 0 0 0 0 S 2.0 0.0 30:56.70 socknal_sd00_00
60861 root 20 0 0 0 0 S 2.0 0.0 30:54.97 socknal_sd00_02
60862 root 20 0 0 0 0 S 2.0 0.0 30:56.06 socknal_sd00_03
60863 root 20 0 0 0 0 S 2.0 0.0 30:56.32 socknal_sd00_04
64334 root 20 0 0 0 0 D 1.3 0.0 7:19.46 ll_ost_io01_010
64329 root 20 0 0 0 0 S 1.0 0.0 7:46.48 ll_ost_io01_005
From: "Moreno Diego (ID SIS)" <diego.moreno at id.ethz.ch>
Date: Wednesday, December 4, 2019 at 11:12 PM
To: Pinkesh Valdria <pinkesh.valdria at oracle.com>, Jongwoo Han <jongwoohan at gmail.com>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet Self Test
I recently did some work on 40Gb and 100Gb ethernet interfaces and these are a few of the things that helped me during lnet_selftest:
On lnet: credits set to higher than the default (e.g: 1024 or more), peer_credits to 128 at least for network testing (it’s just 8 by default which is good for a big cluster maybe not for lnet_selftest with 2 clients),
On ksocklnd module options: more schedulers (10, 6 by default which was not enough for my server), also changed some of the buffers (tx_buffer_size and rx_buffer_size set to 1073741824) but you need to be very careful on these
Sysctl.conf: increase buffers (tcp_rmem, tcp_wmem, check window_scaling, net.core.max and default, check disabling timestamps if you can afford it)
Other: cpupower governor (set to performance at least for testing), BIOS settings (e.g: on my AMD routers it was better to disable HT, disable a few virtualization oriented features and set the PCI config to performance). Basically, be aware that Lustre ethernet’s performance will take CPU resources so better optimize for it
Last but not least be aware that Lustre’s ethernet driver (ksocklnd) does not load balance as well as Infiniband’s (ko2iblnd). I already saw sometimes several Lustre peers using the same socklnd thread on the destination but the other socklnd threads might not be active which means that your entire load is on just dependent on one core. For that the best is to try with more clients and check in your node what’s the cpu load per thread with top. 2 clients do not seem enough to me. With the proper configuration you should be perfectly able to saturate a 25Gb link in lnet_selftest.
Regards,
Diego
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Pinkesh Valdria <pinkesh.valdria at oracle.com>
Date: Thursday, 5 December 2019 at 06:14
To: Jongwoo Han <jongwoohan at gmail.com>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet Self Test
Thanks Jongwoo.
I have the MTU set for 9000 and also ring buffer setting set to max.
ip link set dev $primaryNICInterface mtu 9000
ethtool -G $primaryNICInterface rx 2047 tx 2047 rx-jumbo 8191
I read about changing Interrupt Coalesce, but unable to find what values should be changed and also if it really helps or not.
# Several packets in a rapid sequence can be coalesced into one interrupt passed up to the CPU, providing more CPU time for application processing.
Thanks,
Pinkesh valdria
Oracle Cloud
From: Jongwoo Han <jongwoohan at gmail.com>
Date: Wednesday, December 4, 2019 at 8:07 PM
To: Pinkesh Valdria <pinkesh.valdria at oracle.com>
Cc: Andreas Dilger <adilger at whamcloud.com>, "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet Self Test
Have you tried MTU >= 9000 bytes (AKA jumbo frame) on the 25G ethernet and the switch?
If it is set to 1500 bytes, ethernet + IP + TCP frame headers take quite amount of packet, reducing available bandwidth for data.
Jongwoo Han
2019년 11월 28일 (목) 오전 3:44, Pinkesh Valdria <pinkesh.valdria at oracle.com>님이 작성:
Thanks Andreas for your response.
I ran anotherLnet Self test with 48 concurrent processes, since the nodes have 52 physical cores and I was able to achieve same throughput (2052.71 MiB/s = 2152 MB/s).
Is it expected to lose almost 600 MB/s (2750-2150= ) due to overheads on ethernet with Lnet?
Thanks,
Pinkesh Valdria
Oracle Cloud Infrastructure
From: Andreas Dilger <adilger at whamcloud.com>
Date: Wednesday, November 27, 2019 at 1:25 AM
To: Pinkesh Valdria <pinkesh.valdria at oracle.com>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet Self Test
The first thing to note is that lst reports results in binary units
(MiB/s) while iperf reports results in decimal units (Gbps). If you do the
conversion you get 2055.31 MiB/s = 2155 MB/s.
The other thing to check is the CPU usage. For TCP the CPU usage can
be high. You should try RoCE+o2iblnd instead.
Cheers, Andreas
On Nov 26, 2019, at 21:26, Pinkesh Valdria <pinkesh.valdria at oracle.com> wrote:
Hello All,
I created a new Lustre cluster on CentOS7.6 and I am running lnet_selftest_wrapper.sh to measure throughput on the network. The nodes are connected to each other using 25Gbps ethernet, so theoretical max is 25 Gbps * 125 = 3125 MB/s. Using iperf3, I get 22Gbps (2750 MB/s) between the nodes.
[root at lustre-client-2 ~]# for c in 1 2 4 8 12 16 20 24 ; do echo $c ; ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S) CN=$c SZ=1M TM=30 BRW=write CKSUM=simple LFROM="10.0.3.7 at tcp1" LTO="10.0.3.6 at tcp1" /root/lnet_selftest_wrapper.sh; done ;
When I run lnet_selftest_wrapper.sh (from Lustre wiki) between 2 nodes, I get a max of 2055.31 MiB/s, Is that expected at the Lnet level? Or can I further tune the network and OS kernel (tuning I applied are below) to get better throughput?
Result Snippet from lnet_selftest_wrapper.sh
[LNet Rates of lfrom]
[R] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s
[W] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 0.31 MiB/s Min: 0.31 MiB/s Max: 0.31 MiB/s
[W] Avg: 2055.30 MiB/s Min: 2055.30 MiB/s Max: 2055.30 MiB/s
[LNet Rates of lto]
[R] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s
[W] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s
[LNet Bandwidth of lto]
[R] Avg: 2055.31 MiB/s Min: 2055.31 MiB/s Max: 2055.31 MiB/s
[W] Avg: 0.32 MiB/s Min: 0.32 MiB/s Max: 0.32 MiB/s
Tuning applied:
Ethernet NICs:
ip link set dev ens3 mtu 9000
ethtool -G ens3 rx 2047 tx 2047 rx-jumbo 8191
less /etc/sysctl.conf
net.core.wmem_max=16777216
net.core.rmem_max=16777216
net.core.wmem_default=16777216
net.core.rmem_default=16777216
net.core.optmem_max=16777216
net.core.netdev_max_backlog=27000
kernel.sysrq=1
kernel.shmmax=18446744073692774399
net.core.somaxconn=8192
net.ipv4.tcp_adv_win_scale=2
net.ipv4.tcp_low_latency=1
net.ipv4.tcp_rmem = 212992 87380 16777216
net.ipv4.tcp_sack = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_wmem = 212992 65536 16777216
vm.min_free_kbytes = 65536
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_congestion_control = htcp
net.ipv4.tcp_no_metrics_save = 0
echo "#
# tuned configuration
#
[main]
summary=Broadly applicable tuning that provides excellent performance across a variety of common server workloads
[disk]
devices=!dm-*, !sda1, !sda2, !sda3
readahead=>4096
[cpu]
force_latency=1
governor=performance
energy_perf_bias=performance
min_perf_pct=100
[vm]
transparent_huge_pages=never
[sysctl]
kernel.sched_min_granularity_ns = 10000000
kernel.sched_wakeup_granularity_ns = 15000000
vm.dirty_ratio = 30
vm.dirty_background_ratio = 10
vm.swappiness=30
" > lustre-performance/tuned.conf
tuned-adm profile lustre-performance
Thanks,
Pinkesh Valdria
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
--
Jongwoo Han
+82-505-227-6108
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20191207/1006d57d/attachment-0001.html>
More information about the lustre-discuss
mailing list