[lustre-discuss] Lustre-2.9 / ZFS-0.7.0 tuning recommendations

Nathan R.M. Crawford nrcrawfo at uci.edu
Fri Mar 24 12:30:32 PDT 2017


Hi Keith,

  I'll go back to 1M RC, mostly to avoid possible complications for
uncertain benefit.

  When testing our 60-disk  L2.8/Z0.6.5 system a year ago, I tried several
configurations of vdevs and OSTs. I got annoyingly similar performance with
one big OST, many little OSTs on multiple pools, and many in between. I
went with 2 OSTs on their own 30-disk [3x(8+2) raidz2] pools as a middle
ground that also aligned with multipath load balancing. I have not yet
explored configuration space on our 90-disk system. Is the obdfilter-survey
script going to be sensitive enough to show the differences?

  And yes, the EDR IB card is on a PCIE3x16, with two SAS-12 HBAs on x8's.
Three half-filled JBODs (Dell MD3060e) with 2 SAS-6(x4) ports each.
Multipath is configured, but lots of policy options there too. Fun times!

Thanks,
Nate



On Fri, Mar 24, 2017 at 10:26 AM, Mannthey, Keith <keith.mannthey at intel.com>
wrote:

> > 1 OSS (2x E5-2650 v4, 24 cores, 768GB RAM) with three ODTs.
> >   Each ODT on its own 30-disk pool (3, 10-disk raidz2 vdevs) using 8TB
> > drives pool options:
> >   ashift=12
> > dataset options:
> >   recordsize=16M
> >   compression=lz4
> >   atime=off
> >   xattr=sa
> >   dnodesize=auto
>
> As Andreas said not a lot of work has been done with 16MB RC but there has
> been with 1M and I expect some it will carry over.
>
> 1M RC design guidance is to not use 8+2 but to use more drives in the
> vdev.   If you have 90 drives total 9+2 x 8 vdevs would get you more write
> performance and leave 2 spares.
>
> Spend some time working with vdev/OST layouts to see what works best for
> your system with this new code.   If you have 90 drives I would suggest 4
> OST each 2 x 9+2 vdev as a starting configuration.
>
> Also for best performance you will need more than 1 SAS link to the drives
> and you will need to deal with zoning or multipath.  On the face of it your
> fabric is a PCI 16x device it will take at least 2 PCI 8x HBAs running both
> ports running to match the IO potential.
>
> Thanks,
>  Keith
>
> -----Original Message-----
> From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org] On
> Behalf Of Dilger, Andreas
> Sent: Friday, March 24, 2017 12:46 AM
> To: Nathan R.M. Crawford <nrcrawfo at uci.edu>
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [lustre-discuss] Lustre-2.9 / ZFS-0.7.0 tuning recommendations
>
> On Mar 23, 2017, at 18:38, Nathan R.M. Crawford <nrcrawfo at uci.edu> wrote:
> >
> > Hi All,
> >
> >   I've been evaluating some of the newer options that should be
> available with Lustre 2.9 on top of ZFSonLinux 0.7.0 (currently at rc3).
> Specifically, trying 16MB RPCS/blocks on the OSTs and large dnodes on the
> MDTs.
>
> While these features are available in ZFS 0.7.0, the _use_ of the large
> dnode feature is not enabled in Lustre 2.9.  The patch to leverage large
> dnodes is included in Lustre 2.10, which we are testing periodically with
> ZFS 0.7.0, but it isn't clear whether ZFS 0.7.0 will actually be released
> when Lustre 2.10 is finally released.  It may be that "dnodesize=auto" will
> still provide some benefits over 0.6.5, but it isn't going to be the best.
>
> The other potential issue is that while using 16MB blocksize is
> _possible_, it isn't necessarily _recommended_ to use.  The problem is that
> if you are writing anything smaller than 16MB chunks to the OSTs then you
> will potentially have to write out the same block multiple times (write
> amplification) which can hurt performance.  Also, the large blocksize can
> cause a lot of memory pressure.  Using "recordsize=1024" has been shown to
> give good performance.  We haven't done much study of performance vs.
> recordsize, so any testing in this area would be welcome.  You'd also want
> to tune "lctl set_param obdfilter.*.brw_size=<recordsize in MB>" on all
> the OSS nodes so that clients will send RPCs large enough to cover a whole
> block.
>
> Finally, there are a bunch of metadata performance improvements with ZFS
> 0.7.0 + Lustre 2.9 above what is available in ZFS 0.6.5 + Lustre 2.8, and
> more are on the way for Lustre 2.10.
>
> >   I've gathered bits and pieces from discussions in the zfs and lustre
> development areas, presentations for earlier versions, etc., but have
> probably missed some low-hanging fruit. Is there a central resource for
> good starting sets of parameters?
> >
> > Brief summary of test system (all on CentOS 7.3x):
> > 1 MDS (2x E5-2650 v4, 24 cores, 128GB RAM) with three MDTs.
> >   Each MDT on its own 2-SSD mirror zpool (1TB each)
>
> It's questionable whether you will get a significant benefit from 3 MDTs
> in this case, over just putting 3x VDEVs in a single zpool and have one
> MDT.  The reason is that Lustre DNE is still not great at load balancing
> across separate MDTs, so your users/apps will probably get more use out of
> a single 3TB MDT than three 1TB MDTs.  That is not really close to the
> limit for maximum MDT size.
>
> Cheers, Andreas
>
> > pool options:
> >   ashift=9 (for space reasons)
> > dataset options:
> >   recordsize=128K
> >   compression=off
> >   atime=off
> >   xattr=sa
> >   dnodesize=auto
> > zfs module options:
> >   zfs_prefetch_disable=1
> >   metaslab_debug_unload=1
> >   zfs_dirty_data_max=2147483648
> >   zfs_vdev_async_write_min_active=5
> >   zfs_vdev_async_write_max_active=15
> >   zfs_vdev_async_write_active_min_dirty_percent=20
> >   zfs_vdev_scheduler=deadline
> >   zfs_arc_max=103079215104
> >   zfs_arc_meta_limit=103079215104
> >
> > 1 OSS (2x E5-2650 v4, 24 cores, 768GB RAM) with three ODTs.
> >   Each ODT on its own 30-disk pool (3, 10-disk raidz2 vdevs) using 8TB
> > drives pool options:
> >   ashift=12
> > dataset options:
> >   recordsize=16M
> >   compression=lz4
> >   atime=off
> >   xattr=sa
> >   dnodesize=auto
> > zfs module options:
> >   metaslab_aliquot=2097152
> >   metaslab_debug_unload=1
> >   zfs_dirty_data_max=2147483648
> >   zfs_dirty_data_sync=134217728
> >   zfs_max_recordsize=16777216
> >   zfs_prefetch_disable=1
> >   zfs_txg_history=10
> >   zfs_vdev_aggregation_limit=16777216
> >   zfs_vdev_async_write_min_active=5
> >   zfs_vdev_async_write_max_active=15
> >   zfs_vdev_async_write_active_min_dirty_percent=20
> >   zfs_vdev_scheduler=deadline
> >   zfs_arc_max=751619276800
> >
> > All nodes use Mellanox EDR IB and the distribution's RDMA stack (mlx5).
> > ko2iblnd module options:
> >   concurrent_sends=63
> >   credits=2560
> >   fmr_flush_trigger=1024
> >   fmr_pool_size=1280
> >   map_on_demand=256
> >   peer_buffer_credits=0
> >   peer_credits=16
> >   peer_credits_hiw=31
> >   peer_timeout=100
> >
> > Lustre properly picks up the OST block size, and clients connect with
> 16M. The obdfilter-survey tool gives 2-5GB/s write and rewrite over record
> sizes of 1M-16M. At 16MB, most of the blocks displayed with "zpool iostat
> -r 5" are the expected 2MB for 10-disk raidz2 vdevs.
> >
> > The mds-survey tool also gives not-terrible results, although I don't
> have a good reference:
> > dir_count=6 thrlo=12 thrhi=12 file_count=300000 stripe_count=1 mds-survey
> > mdt 3 file  300000 dir    6 thr   12 create 52875.87 [ 47998.37,
> 57997.85] lookup 579347.67 [ 370250.61, 919106.38] md_getattr 204802.03 [
> 180678.81, 218144.06] setxattr 132886.29 [ 129995.32, 133994.51] destroy
> 20594.50 [ 11999.21, 34998.01]
> >
> > I fully expect that some of the zfs module tuning above is either no
> longer needed (metaslab_debug_unload?), or counterproductive. If anyone has
> suggestions for changes, especially for making insane-numbers-of-tiny-files
> less deadly, I'd like to try them. This isn't going to go into production
> as scratch space for a while.
> >
> > Thanks,
> > Nate
> >
> > --
> > Dr. Nathan Crawford              nathan.crawford at uci.edu
> >
> > Modeling Facility Director
> > Department of Chemistry
> > 1102 Natural Sciences II         Office: 2101 Natural Sciences II
> > University of California, Irvine  Phone: 949-824-4508 Irvine, CA
> > 92697-2025, USA
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>



-- 

Dr. Nathan Crawford              nathan.crawford at uci.edu
Modeling Facility Director
Department of Chemistry
1102 Natural Sciences II         Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170324/c722ceff/attachment.htm>


More information about the lustre-discuss mailing list