[lustre-discuss] Lustre tuning - help
Pinkesh Valdria
pinkesh.valdria at oracle.com
Fri Aug 9 12:11:09 PDT 2019
Lustre experts,
I recently installed Lustre for the first time. Its working (so I am happy), but now I am trying to do some performance testing/tuning. My goal is to run SAS workload and use Lustre as the shared file system for SAS Grid. Later, do tuning of Lustre for generic HPC workload.
Through Google search, I read articles on Lustre and recommendation for tuning from LUG conference slides, etc
https://cpb-us-e1.wpmucdn.com/blogs.rice.edu/dist/0/2327/files/2014/03/Fragalla.pdf
http://cdn.opensfs.org/wp-content/uploads/2019/07/LUG2019-Sysadmin-tutorial.pdf
http://support.sas.com/rnd/scalability/grid/SGMonAWS.pdf
I have results of IBM Spectrum Scale (GPFS) running on same hardware/software stack and based on Lustre tuning I have done, I am not getting optimal performance. My understanding was that Lustre can deliver better performance compare to GPFS, if tuned correctly.
I have tried, changing the following:
Use Stripe count =1, 4, 8, 16, 24 , -1 (to stripe across all OSTs). And progressive file layout: lfs setstripe -E 256M -c 1 -E 4G -c 4 -E -1 -c -1 -S 4M /mnt/mdt_bv
Use Stripe Size: default (1M), 4M, 64K (since SAS apps uses this).
SAS Grid uses large-block, sequential IO patterns. (Block size: 64K, 128K, 256K, - 64K is their preferred value).
Question 1: How should I tune the Stripe Count and Stripe Size for the above. Also should I use Progressive Stripe Layout?
So appreciate, if I can get some feedback on tuning I have done and if its correct and if I am missing anything.
Details:
It’s a cloud based solution – Oracle Cloud Infrastructure. Installed Lustre using instructions on WhamCloud.
All running CentOS 7.
MGS -1 node (shared with MDS), MDS -1 node, OSS -3 nodes. All nodes are Baremetal machines (no VM) with 52 physical cores, 768GB RAM and have 2 NICs (2x25gbps ethernet, no dual bonding). 1 NIC is configured to connect to Block Storage disks. 2nd NiC is configured to talk to clients. So LNET is configured with 2nd NIC. Each OSS is connected to 10 Block Volume disk, 800GB each. So 10 OSTs per OSS. Total of 30 OSTs (21TB storage) . Have 1 MDT (800GB) attached to MDS.
Clients are 24 physical cores VMs, 320GB RAM, 1 NIC (24.6gbps). Using 3 clients in the above setup.
On all nodes (MDS/OSS/Clients):
###########################
### OS Performance tuning
###########################
setenforce 0
echo "
* hard memlock unlimited
* soft memlock unlimited
" >> /etc/security/limits.conf
# The below applies for both compute and server nodes (storage)
cd /usr/lib/tuned/
cp -r throughput-performance/ sas-performance
echo "#
# tuned configuration
#
[main]
include=throughput-performance
summary=Broadly applicable tuning that provides excellent performance across a variety of common server workloads
[disk]
devices=!dm-*, !sda1, !sda2, !sda3
readahead=>4096
[cpu]
force_latency=1
governor=performance
energy_perf_bias=performance
min_perf_pct=100
[vm]
transparent_huge_pages=never
[sysctl]
kernel.sched_min_granularity_ns = 10000000
kernel.sched_wakeup_granularity_ns = 15000000
vm.dirty_ratio = 30
vm.dirty_background_ratio = 10
vm.swappiness=30
" > sas-performance/tuned.conf
tuned-adm profile sas-performance
# Display active profile
tuned-adm active
Networking:
All NICs are configured to use MTU – 9000
Block Volumes/Disks
For all OSTs/MDT:
cat /sys/block/$disk/queue/max_hw_sectors_kb
32767
echo “32767” > /sys/block/$disk/queue/max_sectors_kb ;
echo "192" > /sys/block/$disk/queue/nr_requests ;
echo "deadline" > /sys/block/$disk/queue/scheduler ;
echo "0" > /sys/block/$disk/queue/read_ahead_kb ;
echo "68" > /sys/block/$disk/device/timeout ;
Only OSTs:
lctl set_param osd-ldiskfs.*.readcache_max_filesize=2M
Lustre clients:
lctl set_param osc.*.checksums=0
lctl set_param timeout=600
#lctl set_param ldlm_timeout=200 - This fails with below error
#error: set_param: param_path 'ldlm_timeout': No such file or directory
lctl set_param ldlm_timeout=200
lctl set_param at_min=250
lctl set_param at_max=600
lctl set_param ldlm.namespaces.*.lru_size=128
lctl set_param osc.*.max_rpcs_in_flight=32
lctl set_param osc.*.max_dirty_mb=256
lctl set_param debug="+neterror"
# https://cpb-us-e1.wpmucdn.com/blogs.rice.edu/dist/0/2327/files/2014/03/Fragalla.pdf - says turn off checksum at network level
ethtool -K ens3 rx off tx off
Lustre mounted with -o flock option
mount -t lustre -o flock ${mgs_ip}@tcp1:/$fsname $mount_point
Once again, appreciate any guidance or help you can provide or you can point me to docs, articles, which will be helpful for me.
Thanks,
Pinkesh Valdria
Principal Solutions Architect – Big Data & HPC
Oracle Cloud Infrastructure – Seattle
+1-206-234-4314.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190809/e0d29358/attachment-0001.html>
More information about the lustre-discuss
mailing list