[lustre-discuss] Lustre-2.9 / ZFS-0.7.0 tuning recommendations

Fri Mar 24 11:56:33 PDT 2017

Hi Andreas,

  Thanks for the clarification!

On Fri, Mar 24, 2017 at 12:45 AM, Dilger, Andreas <andreas.dilger at intel.com>
wrote:

> On Mar 23, 2017, at 18:38, Nathan R.M. Crawford <nrcrawfo at uci.edu> wrote:
> >
> > Hi All,
> >
> >  I've been evaluating some of the newer options that should be available
> with Lustre 2.9 on top of ZFSonLinux 0.7.0 (currently at rc3).
> Specifically, trying 16MB RPCS/blocks on the OSTs and large dnodes on the
> MDTs.
>
> While these features are available in ZFS 0.7.0, the _use_ of the large
> dnode feature is not enabled in Lustre 2.9.  The patch to leverage large
> dnodes is included in Lustre 2.10, which we are testing periodically with
> ZFS 0.7.0, but it isn't clear whether ZFS 0.7.0 will actually be released
> when Lustre 2.10 is finally released.  It may be that "dnodesize=auto" will
> still provide some benefits over 0.6.5, but it isn't going to be the best.
>
> Ah, I misunderstood LU-8068 <https://jira.hpdd.intel.com/browse/LU-8068> to
mean that large dnodes were being used in 2.9. Is it currently more correct
to say "large dnodes don't break anything in 2.9"?

The other potential issue is that while using 16MB blocksize is _possible_,
> it isn't necessarily _recommended_ to use.  The problem is that if you are
> writing anything smaller than 16MB chunks to the OSTs then you will
> potentially have to write out the same block multiple times (write
> amplification) which can hurt performance.  Also, the large blocksize can
> cause a lot of memory pressure.  Using "recordsize=1024" has been shown to
> give good performance.  We haven't done much study of performance vs.
> recordsize, so any testing in this area would be welcome.  You'd also want
> to tune "lctl set_param obdfilter.*.brw_size=<recordsize in MB>" on all
> the OSS nodes so that clients will send RPCs large enough to cover a whole
> block.
>
> I had not seen any magic performance improvements with 16MB (brw_size was
set on OSS and clients), but it did not cause obviously crazy behavior
either. I will drop back to 1M for an extra cushion of safety.

> Finally, there are a bunch of metadata performance improvements with ZFS
> 0.7.0 + Lustre 2.9 above what is available in ZFS 0.6.5 + Lustre 2.8, and
> more are on the way for Lustre 2.10.
>
> >  I've gathered bits and pieces from discussions in the zfs and lustre
> development areas, presentations for earlier versions, etc., but have
> probably missed some low-hanging fruit. Is there a central resource for
> good starting sets of parameters?
> >
> > Brief summary of test system (all on CentOS 7.3x):
> > 1 MDS (2x E5-2650 v4, 24 cores, 128GB RAM) with three MDTs.
> >  Each MDT on its own 2-SSD mirror zpool (1TB each)
>
> It's questionable whether you will get a significant benefit from 3 MDTs
> in this case,
> over just putting 3x VDEVs in a single zpool and have one MDT.  The reason
> is that Lustre
> DNE is still not great at load balancing across separate MDTs, so your
> users/apps will
> probably get more use out of a single 3TB MDT than three 1TB MDTs.  That
> is not really
> close to the limit for maximum MDT size.
>
> Our in-production Lustre 2.8 / ZFS 0.6.5.5 system (similar, but slightly
older/smaller/lesser HW) had slightly better aggregate performance with
3MDTs on 3 pools vs 1 on a single pool (3 mirrored vdevs), and I just
carried over the configuration. A BIG caveat is that we only striped the
top few levels of the directory tree (e.g for /lustre/SCRATCH/$group/$user,
each user directory could be on any MDT, but it would not itself be
striped). Doing work within a striped directory was less-than-ideal, and
moving files from one striped directory to another often cased server
lock-ups.

I will double-check that the single-metadata-pool performance is not worse.
Less complication means less potential breakage.

Thanks,
Nate

> Cheers, Andreas
>
> > pool options:
> >  ashift=9 (for space reasons)
> > dataset options:
> >  recordsize=128K
> >  compression=off
> >  atime=off
> >  xattr=sa
> >  dnodesize=auto
> > zfs module options:
> >  zfs_prefetch_disable=1
> >  metaslab_debug_unload=1
> >  zfs_dirty_data_max=2147483648
> >  zfs_vdev_async_write_min_active=5
> >  zfs_vdev_async_write_max_active=15
> >  zfs_vdev_async_write_active_min_dirty_percent=20
> >  zfs_vdev_scheduler=deadline
> >  zfs_arc_max=103079215104
> >  zfs_arc_meta_limit=103079215104
> >
> > 1 OSS (2x E5-2650 v4, 24 cores, 768GB RAM) with three ODTs.
> >  Each ODT on its own 30-disk pool (3, 10-disk raidz2 vdevs) using 8TB
> drives
> > pool options:
> >  ashift=12
> > dataset options:
> >  recordsize=16M
> >  compression=lz4
> >  atime=off
> >  xattr=sa
> >  dnodesize=auto
> > zfs module options:
> >  metaslab_aliquot=2097152
> >  metaslab_debug_unload=1
> >  zfs_dirty_data_max=2147483648
> >  zfs_dirty_data_sync=134217728
> >  zfs_max_recordsize=16777216
> >  zfs_prefetch_disable=1
> >  zfs_txg_history=10
> >  zfs_vdev_aggregation_limit=16777216
> >  zfs_vdev_async_write_min_active=5
> >  zfs_vdev_async_write_max_active=15
> >  zfs_vdev_async_write_active_min_dirty_percent=20
> >  zfs_vdev_scheduler=deadline
> >  zfs_arc_max=751619276800
> >
> > All nodes use Mellanox EDR IB and the distribution's RDMA stack (mlx5).
> > ko2iblnd module options:
> >  concurrent_sends=63
> >  credits=2560
> >  fmr_flush_trigger=1024
> >  fmr_pool_size=1280
> >  map_on_demand=256
> >  peer_buffer_credits=0
> >  peer_credits=16
> >  peer_credits_hiw=31
> >  peer_timeout=100
> >
> > Lustre properly picks up the OST block size, and clients connect with
> 16M. The obdfilter-survey tool gives 2-5GB/s write and rewrite over record
> sizes of 1M-16M. At 16MB, most of the blocks displayed with "zpool iostat
> -r 5" are the expected 2MB for 10-disk raidz2 vdevs.
> >
> > The mds-survey tool also gives not-terrible results, although I don't
> have a good reference:
> > dir_count=6 thrlo=12 thrhi=12 file_count=300000 stripe_count=1 mds-survey
> > mdt 3 file  300000 dir    6 thr   12 create 52875.87 [ 47998.37,
> 57997.85] lookup 579347.67 [ 370250.61, 919106.38] md_getattr 204802.03 [
> 180678.81, 218144.06] setxattr 132886.29 [ 129995.32, 133994.51] destroy
> 20594.50 [ 11999.21, 34998.01]
> >
> > I fully expect that some of the zfs module tuning above is either no
> longer needed (metaslab_debug_unload?), or counterproductive. If anyone has
> suggestions for changes, especially for making insane-numbers-of-tiny-files
> less deadly, I'd like to try them. This isn't going to go into production
> as scratch space for a while.
> >
> > Thanks,
> > Nate
> >
> > --
> > Dr. Nathan Crawford              nathan.crawford at uci.edu
> >
> > Modeling Facility Director
> > Department of Chemistry
> > 1102 Natural Sciences II         Office: 2101 Natural Sciences II
> > University of California, Irvine  Phone: 949-824-4508
> > Irvine, CA 92697-2025, USA
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
>

-- 

Dr. Nathan Crawford              nathan.crawford at uci.edu
Modeling Facility Director
Department of Chemistry
1102 Natural Sciences II         Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170324/62aaabc6/attachment-0001.htm>