[lustre-discuss] Lustre-2.9 / ZFS-0.7.0 tuning recommendations

Fri Mar 24 00:45:32 PDT 2017

On Mar 23, 2017, at 18:38, Nathan R.M. Crawford <nrcrawfo at uci.edu> wrote:
> 
> Hi All,
> 
>   I've been evaluating some of the newer options that should be available with Lustre 2.9 on top of ZFSonLinux 0.7.0 (currently at rc3). Specifically, trying 16MB RPCS/blocks on the OSTs and large dnodes on the MDTs.

While these features are available in ZFS 0.7.0, the _use_ of the large dnode feature is not enabled in Lustre 2.9.  The patch to leverage large dnodes is included in Lustre 2.10, which we are testing periodically with ZFS 0.7.0, but it isn't clear whether ZFS 0.7.0 will actually be released when Lustre 2.10 is finally released.  It may be that "dnodesize=auto" will still provide some benefits over 0.6.5, but it isn't going to be the best.

The other potential issue is that while using 16MB blocksize is _possible_, it isn't necessarily _recommended_ to use.  The problem is that if you are writing anything smaller than 16MB chunks to the OSTs then you will potentially have to write out the same block multiple times (write amplification) which can hurt performance.  Also, the large blocksize can cause a lot of memory pressure.  Using "recordsize=1024" has been shown to give good performance.  We haven't done much study of performance vs. recordsize, so any testing in this area would be welcome.  You'd also want to tune "lctl set_param obdfilter.*.brw_size=<recordsize in MB>" on all the OSS nodes so that clients will send RPCs large enough to cover a whole block.

Finally, there are a bunch of metadata performance improvements with ZFS 0.7.0 + Lustre 2.9 above what is available in ZFS 0.6.5 + Lustre 2.8, and more are on the way for Lustre 2.10.

>   I've gathered bits and pieces from discussions in the zfs and lustre development areas, presentations for earlier versions, etc., but have probably missed some low-hanging fruit. Is there a central resource for good starting sets of parameters?
> 
> Brief summary of test system (all on CentOS 7.3x):
> 1 MDS (2x E5-2650 v4, 24 cores, 128GB RAM) with three MDTs.
>   Each MDT on its own 2-SSD mirror zpool (1TB each)

It's questionable whether you will get a significant benefit from 3 MDTs in this case,
over just putting 3x VDEVs in a single zpool and have one MDT.  The reason is that Lustre
DNE is still not great at load balancing across separate MDTs, so your users/apps will
probably get more use out of a single 3TB MDT than three 1TB MDTs.  That is not really
close to the limit for maximum MDT size.

Cheers, Andreas

> pool options:
>   ashift=9 (for space reasons)
> dataset options:
>   recordsize=128K
>   compression=off
>   atime=off
>   xattr=sa
>   dnodesize=auto
> zfs module options:  
>   zfs_prefetch_disable=1
>   metaslab_debug_unload=1
>   zfs_dirty_data_max=2147483648
>   zfs_vdev_async_write_min_active=5
>   zfs_vdev_async_write_max_active=15
>   zfs_vdev_async_write_active_min_dirty_percent=20
>   zfs_vdev_scheduler=deadline
>   zfs_arc_max=103079215104
>   zfs_arc_meta_limit=103079215104
> 
> 1 OSS (2x E5-2650 v4, 24 cores, 768GB RAM) with three ODTs.
>   Each ODT on its own 30-disk pool (3, 10-disk raidz2 vdevs) using 8TB drives
> pool options:
>   ashift=12
> dataset options:
>   recordsize=16M
>   compression=lz4
>   atime=off
>   xattr=sa
>   dnodesize=auto
> zfs module options:
>   metaslab_aliquot=2097152
>   metaslab_debug_unload=1
>   zfs_dirty_data_max=2147483648
>   zfs_dirty_data_sync=134217728
>   zfs_max_recordsize=16777216
>   zfs_prefetch_disable=1
>   zfs_txg_history=10
>   zfs_vdev_aggregation_limit=16777216
>   zfs_vdev_async_write_min_active=5
>   zfs_vdev_async_write_max_active=15
>   zfs_vdev_async_write_active_min_dirty_percent=20
>   zfs_vdev_scheduler=deadline
>   zfs_arc_max=751619276800
> 
> All nodes use Mellanox EDR IB and the distribution's RDMA stack (mlx5).
> ko2iblnd module options:
>   concurrent_sends=63
>   credits=2560
>   fmr_flush_trigger=1024
>   fmr_pool_size=1280
>   map_on_demand=256
>   peer_buffer_credits=0
>   peer_credits=16
>   peer_credits_hiw=31
>   peer_timeout=100
> 
> Lustre properly picks up the OST block size, and clients connect with 16M. The obdfilter-survey tool gives 2-5GB/s write and rewrite over record sizes of 1M-16M. At 16MB, most of the blocks displayed with "zpool iostat -r 5" are the expected 2MB for 10-disk raidz2 vdevs.
> 
> The mds-survey tool also gives not-terrible results, although I don't have a good reference:
> dir_count=6 thrlo=12 thrhi=12 file_count=300000 stripe_count=1 mds-survey
> mdt 3 file  300000 dir    6 thr   12 create 52875.87 [ 47998.37, 57997.85] lookup 579347.67 [ 370250.61, 919106.38] md_getattr 204802.03 [ 180678.81, 218144.06] setxattr 132886.29 [ 129995.32, 133994.51] destroy 20594.50 [ 11999.21, 34998.01] 
> 
> I fully expect that some of the zfs module tuning above is either no longer needed (metaslab_debug_unload?), or counterproductive. If anyone has suggestions for changes, especially for making insane-numbers-of-tiny-files less deadly, I'd like to try them. This isn't going to go into production as scratch space for a while.
> 
> Thanks,
> Nate
> 
> -- 
> Dr. Nathan Crawford              nathan.crawford at uci.edu
> 
> Modeling Facility Director
> Department of Chemistry
> 1102 Natural Sciences II         Office: 2101 Natural Sciences II
> University of California, Irvine  Phone: 949-824-4508
> Irvine, CA 92697-2025, USA
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation