<div dir="ltr">Hi Keith,<div><br></div><div>  I'll go back to 1M RC, mostly to avoid possible complications for uncertain benefit.</div><div><br></div><div>  When testing our 60-disk  L2.8/Z0.6.5 system a year ago, I tried several configurations of vdevs and OSTs. I got annoyingly similar performance with one big OST, many little OSTs on multiple pools, and many in between. I went with 2 OSTs on their own 30-disk [3x(8+2) raidz2] pools as a middle ground that also aligned with multipath load balancing. I have not yet explored configuration space on our 90-disk system. Is the obdfilter-survey script going to be sensitive enough to show the differences?</div><div><br></div><div>  And yes, the EDR IB card is on a PCIE3x16, with two SAS-12 HBAs on x8's. Three half-filled JBODs (Dell MD3060e) with 2 SAS-6(x4) ports each. Multipath is configured, but lots of policy options there too. Fun times!</div><div><br></div><div>Thanks,</div><div>Nate</div><div><br></div><div>  </div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 24, 2017 at 10:26 AM, Mannthey, Keith <span dir="ltr"><<a href="mailto:keith.mannthey@intel.com" target="_blank">keith.mannthey@intel.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> 1 OSS (2x E5-2650 v4, 24 cores, 768GB RAM) with three ODTs.<br>

>   Each ODT on its own 30-disk pool (3, 10-disk raidz2 vdevs) using 8TB<br>

> drives pool options:<br>

>   ashift=12<br>

> dataset options:<br>

>   recordsize=16M<br>

>   compression=lz4<br>

>   atime=off<br>

>   xattr=sa<br>

>   dnodesize=auto<br>

<br>

</span>As Andreas said not a lot of work has been done with 16MB RC but there has been with 1M and I expect some it will carry over.<br>

<br>

1M RC design guidance is to not use 8+2 but to use more drives in the vdev.   If you have 90 drives total 9+2 x 8 vdevs would get you more write performance and leave 2 spares.<br>

<br>

Spend some time working with vdev/OST layouts to see what works best for your system with this new code.   If you have 90 drives I would suggest 4 OST each 2 x 9+2 vdev as a starting configuration.<br>

<br>

Also for best performance you will need more than 1 SAS link to the drives and you will need to deal with zoning or multipath.  On the face of it your fabric is a PCI 16x device it will take at least 2 PCI 8x HBAs running both ports running to match the IO potential.<br>

<br>

Thanks,<br>

 Keith<br>

<div class="HOEnZb"><div class="h5"><br>

-----Original Message-----<br>

From: lustre-discuss [mailto:<a href="mailto:lustre-discuss-bounces@lists.lustre.org">lustre-discuss-<wbr>bounces@lists.lustre.org</a>] On Behalf Of Dilger, Andreas<br>

Sent: Friday, March 24, 2017 12:46 AM<br>

To: Nathan R.M. Crawford <<a href="mailto:nrcrawfo@uci.edu">nrcrawfo@uci.edu</a>><br>

Cc: <a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a><br>

Subject: Re: [lustre-discuss] Lustre-2.9 / ZFS-0.7.0 tuning recommendations<br>

<br>

On Mar 23, 2017, at 18:38, Nathan R.M. Crawford <<a href="mailto:nrcrawfo@uci.edu">nrcrawfo@uci.edu</a>> wrote:<br>

><br>

> Hi All,<br>

><br>

>   I've been evaluating some of the newer options that should be available with Lustre 2.9 on top of ZFSonLinux 0.7.0 (currently at rc3). Specifically, trying 16MB RPCS/blocks on the OSTs and large dnodes on the MDTs.<br>

<br>

While these features are available in ZFS 0.7.0, the _use_ of the large dnode feature is not enabled in Lustre 2.9.  The patch to leverage large dnodes is included in Lustre 2.10, which we are testing periodically with ZFS 0.7.0, but it isn't clear whether ZFS 0.7.0 will actually be released when Lustre 2.10 is finally released.  It may be that "dnodesize=auto" will still provide some benefits over 0.6.5, but it isn't going to be the best.<br>

<br>

The other potential issue is that while using 16MB blocksize is _possible_, it isn't necessarily _recommended_ to use.  The problem is that if you are writing anything smaller than 16MB chunks to the OSTs then you will potentially have to write out the same block multiple times (write amplification) which can hurt performance.  Also, the large blocksize can cause a lot of memory pressure.  Using "recordsize=1024" has been shown to give good performance.  We haven't done much study of performance vs. recordsize, so any testing in this area would be welcome.  You'd also want to tune "lctl set_param obdfilter.*.brw_size=<<wbr>recordsize in MB>" on all the OSS nodes so that clients will send RPCs large enough to cover a whole block.<br>

<br>

Finally, there are a bunch of metadata performance improvements with ZFS 0.7.0 + Lustre 2.9 above what is available in ZFS 0.6.5 + Lustre 2.8, and more are on the way for Lustre 2.10.<br>

<br>

>   I've gathered bits and pieces from discussions in the zfs and lustre development areas, presentations for earlier versions, etc., but have probably missed some low-hanging fruit. Is there a central resource for good starting sets of parameters?<br>

><br>

> Brief summary of test system (all on CentOS 7.3x):<br>

> 1 MDS (2x E5-2650 v4, 24 cores, 128GB RAM) with three MDTs.<br>

>   Each MDT on its own 2-SSD mirror zpool (1TB each)<br>

<br>

It's questionable whether you will get a significant benefit from 3 MDTs in this case, over just putting 3x VDEVs in a single zpool and have one MDT.  The reason is that Lustre DNE is still not great at load balancing across separate MDTs, so your users/apps will probably get more use out of a single 3TB MDT than three 1TB MDTs.  That is not really close to the limit for maximum MDT size.<br>

<br>

Cheers, Andreas<br>

<br>

> pool options:<br>

>   ashift=9 (for space reasons)<br>

> dataset options:<br>

>   recordsize=128K<br>

>   compression=off<br>

>   atime=off<br>

>   xattr=sa<br>

>   dnodesize=auto<br>

> zfs module options:<br>

>   zfs_prefetch_disable=1<br>

>   metaslab_debug_unload=1<br>

>   zfs_dirty_data_max=2147483648<br>

>   zfs_vdev_async_write_min_<wbr>active=5<br>

>   zfs_vdev_async_write_max_<wbr>active=15<br>

>   zfs_vdev_async_write_active_<wbr>min_dirty_percent=20<br>

>   zfs_vdev_scheduler=deadline<br>

>   zfs_arc_max=103079215104<br>

>   zfs_arc_meta_limit=<wbr>103079215104<br>

><br>

> 1 OSS (2x E5-2650 v4, 24 cores, 768GB RAM) with three ODTs.<br>

>   Each ODT on its own 30-disk pool (3, 10-disk raidz2 vdevs) using 8TB<br>

> drives pool options:<br>

>   ashift=12<br>

> dataset options:<br>

>   recordsize=16M<br>

>   compression=lz4<br>

>   atime=off<br>

>   xattr=sa<br>

>   dnodesize=auto<br>

> zfs module options:<br>

>   metaslab_aliquot=2097152<br>

>   metaslab_debug_unload=1<br>

>   zfs_dirty_data_max=2147483648<br>

>   zfs_dirty_data_sync=134217728<br>

>   zfs_max_recordsize=16777216<br>

>   zfs_prefetch_disable=1<br>

>   zfs_txg_history=10<br>

>   zfs_vdev_aggregation_limit=<wbr>16777216<br>

>   zfs_vdev_async_write_min_<wbr>active=5<br>

>   zfs_vdev_async_write_max_<wbr>active=15<br>

>   zfs_vdev_async_write_active_<wbr>min_dirty_percent=20<br>

>   zfs_vdev_scheduler=deadline<br>

>   zfs_arc_max=751619276800<br>

><br>

> All nodes use Mellanox EDR IB and the distribution's RDMA stack (mlx5).<br>

> ko2iblnd module options:<br>

>   concurrent_sends=63<br>

>   credits=2560<br>

>   fmr_flush_trigger=1024<br>

>   fmr_pool_size=1280<br>

>   map_on_demand=256<br>

>   peer_buffer_credits=0<br>

>   peer_credits=16<br>

>   peer_credits_hiw=31<br>

>   peer_timeout=100<br>

><br>

> Lustre properly picks up the OST block size, and clients connect with 16M. The obdfilter-survey tool gives 2-5GB/s write and rewrite over record sizes of 1M-16M. At 16MB, most of the blocks displayed with "zpool iostat -r 5" are the expected 2MB for 10-disk raidz2 vdevs.<br>

><br>

> The mds-survey tool also gives not-terrible results, although I don't have a good reference:<br>

> dir_count=6 thrlo=12 thrhi=12 file_count=300000 stripe_count=1 mds-survey<br>

> mdt 3 file  300000 dir    6 thr   12 create 52875.87 [ 47998.37, 57997.85] lookup 579347.67 [ 370250.61, 919106.38] md_getattr 204802.03 [ 180678.81, 218144.06] setxattr 132886.29 [ 129995.32, 133994.51] destroy 20594.50 [ 11999.21, 34998.01]<br>

><br>

> I fully expect that some of the zfs module tuning above is either no longer needed (metaslab_debug_unload?), or counterproductive. If anyone has suggestions for changes, especially for making insane-numbers-of-tiny-files less deadly, I'd like to try them. This isn't going to go into production as scratch space for a while.<br>

><br>

> Thanks,<br>

> Nate<br>

><br>

> --<br>

> Dr. Nathan Crawford              <a href="mailto:nathan.crawford@uci.edu">nathan.crawford@uci.edu</a><br>

><br>

> Modeling Facility Director<br>

> Department of Chemistry<br>

> 1102 Natural Sciences II         Office: 2101 Natural Sciences II<br>

> University of California, Irvine  Phone: <a href="tel:949-824-4508" value="+19498244508">949-824-4508</a> Irvine, CA<br>

> 92697-2025, USA<br>

><br>

> ______________________________<wbr>_________________<br>

> lustre-discuss mailing list<br>

> <a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a><br>

> <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/<wbr>listinfo.cgi/lustre-discuss-<wbr>lustre.org</a><br>

<br>

Cheers, Andreas<br>

--<br>

Andreas Dilger<br>

Lustre Principal Architect<br>

Intel Corporation<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

______________________________<wbr>_________________<br>

lustre-discuss mailing list<br>

<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a><br>

<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/<wbr>listinfo.cgi/lustre-discuss-<wbr>lustre.org</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><pre>Dr. Nathan Crawford              <a href="mailto:nathan.crawford@uci.edu" target="_blank">nathan.crawford@uci.edu</a>

Modeling Facility Director

Department of Chemistry

1102 Natural Sciences II         Office: 2101 Natural Sciences II

University of California, Irvine  Phone: 949-824-4508

Irvine, CA 92697-2025, USA</pre></div></div>

</div>