[lustre-discuss] Lustre-2.9 / ZFS-0.7.0 tuning recommendations
Nathan R.M. Crawford
nrcrawfo at uci.edu
Thu Mar 23 17:38:03 PDT 2017
Hi All,
I've been evaluating some of the newer options that should be available
with Lustre 2.9 on top of ZFSonLinux 0.7.0 (currently at rc3).
Specifically, trying 16MB RPCS/blocks on the OSTs and large dnodes on the
MDTs.
I've gathered bits and pieces from discussions in the zfs and lustre
development areas, presentations for earlier versions, etc., but have
probably missed some low-hanging fruit. Is there a central resource for
good starting sets of parameters?
Brief summary of test system (all on CentOS 7.3x):
1 MDS (2x E5-2650 v4, 24 cores, 128GB RAM) with three MDTs.
Each MDT on its own 2-SSD mirror zpool (1TB each)
pool options:
ashift=9 (for space reasons)
dataset options:
recordsize=128K
compression=off
atime=off
xattr=sa
dnodesize=auto
zfs module options:
zfs_prefetch_disable=1
metaslab_debug_unload=1
zfs_dirty_data_max=2147483648
zfs_vdev_async_write_min_active=5
zfs_vdev_async_write_max_active=15
zfs_vdev_async_write_active_min_dirty_percent=20
zfs_vdev_scheduler=deadline
zfs_arc_max=103079215104
zfs_arc_meta_limit=103079215104
1 OSS (2x E5-2650 v4, 24 cores, 768GB RAM) with three ODTs.
Each ODT on its own 30-disk pool (3, 10-disk raidz2 vdevs) using 8TB
drives
pool options:
ashift=12
dataset options:
recordsize=16M
compression=lz4
atime=off
xattr=sa
dnodesize=auto
zfs module options:
metaslab_aliquot=2097152
metaslab_debug_unload=1
zfs_dirty_data_max=2147483648
zfs_dirty_data_sync=134217728
zfs_max_recordsize=16777216
zfs_prefetch_disable=1
zfs_txg_history=10
zfs_vdev_aggregation_limit=16777216
zfs_vdev_async_write_min_active=5
zfs_vdev_async_write_max_active=15
zfs_vdev_async_write_active_min_dirty_percent=20
zfs_vdev_scheduler=deadline
zfs_arc_max=751619276800
All nodes use Mellanox EDR IB and the distribution's RDMA stack (mlx5).
ko2iblnd module options:
concurrent_sends=63
credits=2560
fmr_flush_trigger=1024
fmr_pool_size=1280
map_on_demand=256
peer_buffer_credits=0
peer_credits=16
peer_credits_hiw=31
peer_timeout=100
Lustre properly picks up the OST block size, and clients connect with 16M.
The obdfilter-survey tool gives 2-5GB/s write and rewrite over record sizes
of 1M-16M. At 16MB, most of the blocks displayed with "zpool iostat -r 5"
are the expected 2MB for 10-disk raidz2 vdevs.
The mds-survey tool also gives not-terrible results, although I don't have
a good reference:
dir_count=6 thrlo=12 thrhi=12 file_count=300000 stripe_count=1 mds-survey
mdt 3 file 300000 dir 6 thr 12 create 52875.87 [ 47998.37, 57997.85]
lookup 579347.67 [ 370250.61, 919106.38] md_getattr 204802.03 [ 180678.81,
218144.06] setxattr 132886.29 [ 129995.32, 133994.51] destroy 20594.50 [
11999.21, 34998.01]
I fully expect that some of the zfs module tuning above is either no longer
needed (metaslab_debug_unload?), or counterproductive. If anyone has
suggestions for changes, especially for making insane-numbers-of-tiny-files
less deadly, I'd like to try them. This isn't going to go into production
as scratch space for a while.
Thanks,
Nate
--
Dr. Nathan Crawford nathan.crawford at uci.edu
Modeling Facility Director
Department of Chemistry
1102 Natural Sciences II Office: 2101 Natural Sciences II
University of California, Irvine Phone: 949-824-4508
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170323/8f612161/attachment.htm>
More information about the lustre-discuss
mailing list