[lustre-discuss] Lustre tuning - help

Pinkesh Valdria pinkesh.valdria at oracle.com
Fri Aug 9 12:11:09 PDT 2019


Lustre experts, 

 

I recently installed Lustre for the first time.  Its working (so I am happy),  but now I am trying to do some performance testing/tuning.   My goal is to run SAS workload and use Lustre as the shared file system for SAS Grid.    Later,  do tuning of Lustre for generic HPC workload.   

 

 

Through Google search,  I read articles on Lustre and recommendation for tuning from LUG conference slides, etc 

https://cpb-us-e1.wpmucdn.com/blogs.rice.edu/dist/0/2327/files/2014/03/Fragalla.pdf

http://cdn.opensfs.org/wp-content/uploads/2019/07/LUG2019-Sysadmin-tutorial.pdf

http://support.sas.com/rnd/scalability/grid/SGMonAWS.pdf

 

I have results of IBM Spectrum Scale (GPFS) running on same hardware/software stack and based on Lustre tuning I have done,  I am not getting optimal performance.    My understanding was that Lustre can deliver better performance compare to GPFS, if tuned correctly.  

I have tried, changing the following:

Use Stripe count =1, 4, 8, 16, 24 , -1 (to stripe across all OSTs). And progressive file layout: lfs setstripe -E 256M -c 1 -E 4G -c 4 -E -1 -c -1 -S 4M  /mnt/mdt_bv

Use Stripe Size:  default (1M),  4M,  64K (since SAS apps uses this).

 

SAS Grid uses large-block, sequential IO patterns. (Block size: 64K, 128K, 256K,  - 64K is their preferred value).   

 

Question 1:  How should I tune the Stripe Count and Stripe Size  for the above.  Also should I use Progressive Stripe Layout? 

 

 

So appreciate, if I can get some feedback on tuning I have done and if its correct and if I am missing anything.  

 

 

Details:

It’s a cloud based solution – Oracle Cloud Infrastructure.   Installed Lustre using instructions on WhamCloud. 

All running CentOS 7.

MGS -1 node (shared with MDS), MDS -1 node, OSS -3 nodes.    All nodes are Baremetal machines (no VM) with  52 physical cores, 768GB RAM and have 2 NICs (2x25gbps ethernet, no dual bonding).   1 NIC is configured to connect to Block Storage disks.  2nd NiC is configured to talk to clients.   So LNET is configured with 2nd NIC.       Each OSS is connected to 10 Block Volume disk, 800GB each.   So 10 OSTs per OSS.   Total of 30 OSTs (21TB storage) .  Have 1 MDT (800GB) attached to MDS.   

 

Clients are 24 physical cores VMs,  320GB RAM, 1 NIC (24.6gbps).  Using 3 clients in the above setup.  

 

 

On all nodes (MDS/OSS/Clients): 

 

###########################

### OS Performance tuning

###########################

 

setenforce 0

echo "

*          hard   memlock           unlimited

*          soft    memlock           unlimited

" >> /etc/security/limits.conf

 

# The below applies for both compute and server nodes (storage)

cd /usr/lib/tuned/

cp -r throughput-performance/ sas-performance

 

echo "#

# tuned configuration

#

[main]

include=throughput-performance

summary=Broadly applicable tuning that provides excellent performance across a variety of common server workloads

[disk]

devices=!dm-*, !sda1, !sda2, !sda3

readahead=>4096

 

[cpu]

force_latency=1

governor=performance

energy_perf_bias=performance

min_perf_pct=100

[vm]

transparent_huge_pages=never

[sysctl]

kernel.sched_min_granularity_ns = 10000000

kernel.sched_wakeup_granularity_ns = 15000000

vm.dirty_ratio = 30

vm.dirty_background_ratio = 10

vm.swappiness=30

" > sas-performance/tuned.conf

 

tuned-adm profile sas-performance

 

# Display active profile

tuned-adm active

   

 

Networking: 

All NICs are configured to use MTU – 9000

 

Block Volumes/Disks 

For all OSTs/MDT: 

cat /sys/block/$disk/queue/max_hw_sectors_kb 

32767

echo “32767” > /sys/block/$disk/queue/max_sectors_kb ;

echo "192" > /sys/block/$disk/queue/nr_requests ;

echo "deadline" > /sys/block/$disk/queue/scheduler ;

echo "0" > /sys/block/$disk/queue/read_ahead_kb ;

echo "68" > /sys/block/$disk/device/timeout ;

 

Only OSTs: 

lctl set_param osd-ldiskfs.*.readcache_max_filesize=2M

 

   

Lustre clients:

lctl set_param osc.*.checksums=0

lctl set_param timeout=600

#lctl set_param ldlm_timeout=200  - This fails with below error 

#error: set_param: param_path 'ldlm_timeout': No such file or directory

lctl set_param ldlm_timeout=200

lctl set_param at_min=250

lctl set_param at_max=600

lctl set_param ldlm.namespaces.*.lru_size=128

lctl set_param osc.*.max_rpcs_in_flight=32

lctl set_param osc.*.max_dirty_mb=256

lctl set_param debug="+neterror"

 

 

# https://cpb-us-e1.wpmucdn.com/blogs.rice.edu/dist/0/2327/files/2014/03/Fragalla.pdf - says turn off checksum at network level

ethtool -K ens3 rx off tx off

 

Lustre mounted with  -o flock option

mount -t lustre -o flock ${mgs_ip}@tcp1:/$fsname $mount_point

 

 

Once again, appreciate any guidance or help you can provide or you can point me to docs, articles, which will be helpful for me. 

 

 

Thanks,

Pinkesh Valdria

Principal Solutions Architect – Big Data & HPC

Oracle Cloud Infrastructure – Seattle 

+1-206-234-4314. 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190809/e0d29358/attachment-0001.html>


More information about the lustre-discuss mailing list