[lustre-discuss] lustre 2.10.5 or 2.11.0

Tue Oct 30 16:35:19 PDT 2018

Sorry for replaying late, I answered in-line

On 10/21/18 6:00 AM, Andreas Dilger wrote:
> It would be useful to post information like this on wiki.lustre.org so they can be found more easily by others.  There are already some ZFS tunings there (I don't have the URL handy, just on a plane), so it might be useful to include some information about the hardware and workload to give context to what this is tuned for.
>
> Even more interesting would be to see if there is a general set of tunings that people agree should be made the default?  It is even better when new users don't have to seek out the various tuning parameters, and instead get good performance out of the box.
>
> A few comments inline...
>
> On Oct 19, 2018, at 17:52, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
>> On 10/19/18 12:37 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
>>>> On Oct 17, 2018, at 7:30 PM, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
>>>>
>>>> anyway especially regarding the OSSes you may eventually need some ZFS module parameters optimizations regarding vdev_write and vdev_read max to increase those values higher than default. You may also disable ZIL, change the redundant_metadata to "most"  atime off.
>>>>
>>>> I could send you a list of parameters that in my case work well.
>>> Riccardo,
>>>
>>> Would you mind sharing your ZFS parameters with the mailing list?  I would be interested to see which options you have changed.
>>>
>> this worked for me on my high performance cluster
>>
>> options zfs zfs_prefetch_disable=1
> This matches what I've seen in the past - at high bandwidth under concurrent client load the prefetched data on the server is lost, and just causes needless disk IO that is discarded.
>
>> options zfs zfs_txg_history=120
>> options zfs metaslab_debug_unload=1
>> #
>> options zfs zfs_vdev_scheduler=deadline
>> options zfs zfs_vdev_async_write_active_min_dirty_percent=20
>> #
>> options zfs zfs_vdev_scrub_min_active=48
>> options zfs zfs_vdev_scrub_max_active=128
>> #
>> options zfs zfs_vdev_sync_write_min_active=8
>> options zfs zfs_vdev_sync_write_max_active=32
>> options zfs zfs_vdev_sync_read_min_active=8
>> options zfs zfs_vdev_sync_read_max_active=32
>> options zfs zfs_vdev_async_read_min_active=8
>> options zfs zfs_vdev_async_read_max_active=32
>> options zfs zfs_top_maxinflight=320
>> options zfs zfs_txg_timeout=30
> This is interesting.  Is this actually setting the maximum TXG age up to 30s?

yes, I think the default is 5 seconds.

>
>> options zfs zfs_dirty_data_max_percent=40
>> options zfs zfs_vdev_async_write_min_active=8
>> options zfs zfs_vdev_async_write_max_active=32
>>
>> ##############
>>
>> these the zfs attributes that I changed on the OSSes:
>>
>> zfs set mountpoint=none $ostpool
>> zfs set sync=disabled $ostpool
>> zfs set atime=off $ostpool
>> zfs set redundant_metadata=most $ostpool
>> zfs set xattr=sa $ostpool
>> zfs set recordsize=1M $ostpool
> The recordsize=1M is already the default for Lustre OSTs.
>
> Did you disable multimount, or just not include it here?  That is fairly
> important for any multi-homed ZFS storage, to prevent multiple imports.
>
>> #################
>>
>>
>> these the ko2iblnd parameters for FDR Mellanox IB interfaces
>>
>> options ko2iblnd timeout=100 peer_credits=63 credits=2560 concurrent_sends=63 ntx=2048 fmr_pool_size=1280 fmr_flush_trigger=1024 ntx=5120
> You have ntx= in there twice...

yes it is a mistake I typed it two times

>
> If this provides a significant improvement for FDR, it might make sense to add in machinery to lustre/conf/{ko2iblnd-probe,ko2iblnd.conf} to have a new alias "ko2iblnd-fdr" set these values on Mellanox FDB IB cards by default?

I found it it works better with FDR.

Anyway most of the tunings I did were taken here and there reading what 
other people did. So mostly from here:

  * https://lustre.ornl.gov/lustre101-courses/content/C1/L5/LustreTuning.pdf
  * https://www.eofs.eu/_media/events/lad15/15_chris_horn_lad_2015_lnet.pdf
  * https://lustre.ornl.gov/ecosystem-2015/documents/LustreEco2015-Tutorial2.pdf

And by the way the most effective tweaks were after reading Rick Mohr 
advice  in LustreTuning.pdf, Thanks Rick!

>
>> ############
>>
>> these the ksocklnd paramaters
>>
>> options ksocklnd sock_timeout=100 credits=2560 peer_credits=63
>>
>> ##############
>>
>> these other parameters that I did tweak
>>
>> echo 32 > /sys/module/ptlrpc/parameters/max_ptlrpcds
>> echo 3 > /sys/module/ptlrpc/parameters/ptlrpcd_bind_policy
> This parameter is marked as obsolete in the code.

Yes I should fix my configuration and use the new parameters

>> lctl set_param timeout=600
>> lctl set_param ldlm_timeout=200
>> lctl set_param at_min=250
>> lctl set_param at_max=600
>>
>> ###########
>>
>> Also I run this script at boot time to redefine IRQ assignments for hard drives spanned across all CPUs, not needed for kernel > 4.4
>>
>> #!/bin/sh
>> # numa_smp.sh
>> device=$1
>> cpu1=$2
>> cpu2=$3
>> cpu=$cpu1
>> grep $1 /proc/interrupts|awk '{print $1}'|sed 's/://'|while read int
>> do
>>   echo $cpu > /proc/irq/$int/smp_affinity_list
>>   echo "echo CPU $cpu > /proc/irq/$a/smp_affinity_list"
>>   if [ $cpu = $cpu2 ]
>>   then
>>      cpu=$cpu1
>>   else
>>      ((cpu=$cpu+1))
>>   fi
>> done
> Cheers, Andreas
> ---
> Andreas Dilger
> Principal Lustre Architect
> Whamcloud
>
>
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20181030/261f52df/attachment.html>