[lustre-discuss] Lustre-2.9 / ZFS-0.7.0 tuning recommendations

Thu Mar 23 17:38:03 PDT 2017

Hi All,

  I've been evaluating some of the newer options that should be available
with Lustre 2.9 on top of ZFSonLinux 0.7.0 (currently at rc3).
Specifically, trying 16MB RPCS/blocks on the OSTs and large dnodes on the
MDTs.

  I've gathered bits and pieces from discussions in the zfs and lustre
development areas, presentations for earlier versions, etc., but have
probably missed some low-hanging fruit. Is there a central resource for
good starting sets of parameters?

Brief summary of test system (all on CentOS 7.3x):
1 MDS (2x E5-2650 v4, 24 cores, 128GB RAM) with three MDTs.
  Each MDT on its own 2-SSD mirror zpool (1TB each)
pool options:
  ashift=9 (for space reasons)
dataset options:
  recordsize=128K
  compression=off
  atime=off
  xattr=sa
  dnodesize=auto
zfs module options:
  zfs_prefetch_disable=1
  metaslab_debug_unload=1
  zfs_dirty_data_max=2147483648
  zfs_vdev_async_write_min_active=5
  zfs_vdev_async_write_max_active=15
  zfs_vdev_async_write_active_min_dirty_percent=20
  zfs_vdev_scheduler=deadline
  zfs_arc_max=103079215104
  zfs_arc_meta_limit=103079215104

1 OSS (2x E5-2650 v4, 24 cores, 768GB RAM) with three ODTs.
  Each ODT on its own 30-disk pool (3, 10-disk raidz2 vdevs) using 8TB
drives
pool options:
  ashift=12
dataset options:
  recordsize=16M
  compression=lz4
  atime=off
  xattr=sa
  dnodesize=auto
zfs module options:
  metaslab_aliquot=2097152
  metaslab_debug_unload=1
  zfs_dirty_data_max=2147483648
  zfs_dirty_data_sync=134217728
  zfs_max_recordsize=16777216
  zfs_prefetch_disable=1
  zfs_txg_history=10
  zfs_vdev_aggregation_limit=16777216
  zfs_vdev_async_write_min_active=5
  zfs_vdev_async_write_max_active=15
  zfs_vdev_async_write_active_min_dirty_percent=20
  zfs_vdev_scheduler=deadline
  zfs_arc_max=751619276800

All nodes use Mellanox EDR IB and the distribution's RDMA stack (mlx5).
ko2iblnd module options:
  concurrent_sends=63
  credits=2560
  fmr_flush_trigger=1024
  fmr_pool_size=1280
  map_on_demand=256
  peer_buffer_credits=0
  peer_credits=16
  peer_credits_hiw=31
  peer_timeout=100

Lustre properly picks up the OST block size, and clients connect with 16M.
The obdfilter-survey tool gives 2-5GB/s write and rewrite over record sizes
of 1M-16M. At 16MB, most of the blocks displayed with "zpool iostat -r 5"
are the expected 2MB for 10-disk raidz2 vdevs.

The mds-survey tool also gives not-terrible results, although I don't have
a good reference:
dir_count=6 thrlo=12 thrhi=12 file_count=300000 stripe_count=1 mds-survey
mdt 3 file  300000 dir    6 thr   12 create 52875.87 [ 47998.37, 57997.85]
lookup 579347.67 [ 370250.61, 919106.38] md_getattr 204802.03 [ 180678.81,
218144.06] setxattr 132886.29 [ 129995.32, 133994.51] destroy 20594.50 [
11999.21, 34998.01]

I fully expect that some of the zfs module tuning above is either no longer
needed (metaslab_debug_unload?), or counterproductive. If anyone has
suggestions for changes, especially for making insane-numbers-of-tiny-files
less deadly, I'd like to try them. This isn't going to go into production
as scratch space for a while.

Thanks,
Nate

-- 

Dr. Nathan Crawford              nathan.crawford at uci.edu
Modeling Facility Director
Department of Chemistry
1102 Natural Sciences II         Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170323/8f612161/attachment.htm>