[lustre-discuss] Lustre MDT with ZFS backend

Sat Jan 23 03:26:33 PST 2021

On Jan 22, 2021, at 02:01, Tung-Han Hsieh <thhsieh at twcp1.phys.ntu.edu.tw<mailto:thhsieh at twcp1.phys.ntu.edu.tw>> wrote:

Recently we read from this mailing list and learned that currently
Lustre file system MDT with ZFS backend has poor performance. Too
bad that we did not notice this before deploying this configuration
to many of our cluster systems. Now that there are already too much
amount of data to reconfigure them back to MDT with ldiskfs backend.

So now the only way for us is looking forward and try to do our best.
I read this article about the improvement of MDT with ZFS:

https://www.nextplatform.com/2017/01/11/bolstering-lustre-zfs-highlights-continuing-work/

It seems that MDT with ZFS already had large improvement over time,
but in our applications it is still not enough. So I am asking that
whether there is a roadmap to improve this part ?

There are a number of different organizations that are using Lustre and
ZFS, and of course many developers working to improve ZFS itself.

Definitely you would want to use all-flash storage for the MDT, if you
are not already.

Furthermore, I saw that ZFS already has version 2.0.1. Is there any
plan of Lustre part to integrate and take advantages of the new ZFS
software ?

The soon-to-be-released Lustre 2.14 is known to work with ZFS 2.0, so
you are welcome to test that out and report if it improves performance.

Cheers, Andreas

In the meanwhile, currently we have the following configurations and
tunings in our Lustre system to try to overcome various performance
bottlenecks (both MDT and OST are ZFS backend):

- Linux kernel 4.19.126 + MLNX_OFED-4.6 + Lustre-2.12.6 + ZFS-0.7.3
 (ps. We build all the above software by ourselves in Debian-9.12)

- Loading zfs module with the following options
 (Many thanks to the suggestions by Riccardo Veraldi):
 options zfs zfs_prefetch_disable=1
 options zfs zfs_txg_history=120
 options zfs metaslab_debug_unload=1
 options zfs zfs_vdev_async_write_active_min_dirty_percent=20
 options zfs zfs_vdev_scrub_min_active=48
 options zfs zfs_vdev_scrub_max_active=128
 options zfs zfs_vdev_sync_write_min_active=8
 options zfs zfs_vdev_sync_write_max_active=32
 options zfs zfs_vdev_sync_read_min_active=8
 options zfs zfs_vdev_sync_read_max_active=32
 options zfs zfs_vdev_async_read_min_active=8
 options zfs zfs_vdev_async_read_max_active=32
 options zfs zfs_top_maxinflight=320
 options zfs zfs_txg_timeout=30
 options zfs zfs_dirty_data_max_percent=40
 options zfs zfs_vdev_async_write_min_active=8
 options zfs zfs_vdev_async_write_max_active=32

- ZFS pool is configured with the following options:
 zfs set atime=off <pool>
 zfs set redundant_metadata=most <pool>
 zfs set xattr=sa <pool>
 zfs set recordsize=1M <pool>

- Set the grant_shrink option to 0 for all clients of Lustre:
 lctl set_param osc.*.grant_shrink=0

These are all we have learned so far. We are wondering whether
there are still something we have overlooked (e.g., are iommu
and intel_iommu settings of kernel parameters help)? We will be
very appreciated if anyone could give us further suggestions.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210123/1f1ae956/attachment.html>