[lustre-discuss] Lustre MDT with ZFS backend

Tung-Han Hsieh thhsieh at twcp1.phys.ntu.edu.tw
Fri Jan 22 01:01:40 PST 2021


Dear All,

Recently we read from this mailing list and learned that currently
Lustre file system MDT with ZFS backend has poor performance. Too
bad that we did not notice this before deploying this configuration
to many of our cluster systems. Now that there are already too much
amount of data to reconfigure them back to MDT with ldiskfs backend.

So now the only way for us is looking forward and try to do our best.
I read this article about the improvement of MDT with ZFS:

https://www.nextplatform.com/2017/01/11/bolstering-lustre-zfs-highlights-continuing-work/

It seems that MDT with ZFS already had large improvement over time,
but in our applications it is still not enough. So I am asking that
whether there is a roadmap to improve this part ?

Furthermore, I saw that ZFS already has version 2.0.1. Is there any
plan of Lustre part to integrate and take advantages of the new ZFS
software ?

In the meanwhile, currently we have the following configurations and
tunings in our Lustre system to try to overcome various performance
bottlenecks (both MDT and OST are ZFS backend):

- Linux kernel 4.19.126 + MLNX_OFED-4.6 + Lustre-2.12.6 + ZFS-0.7.3
  (ps. We build all the above software by ourselves in Debian-9.12)

- Loading zfs module with the following options
  (Many thanks to the suggestions by Riccardo Veraldi):
  options zfs zfs_prefetch_disable=1
  options zfs zfs_txg_history=120
  options zfs metaslab_debug_unload=1
  options zfs zfs_vdev_async_write_active_min_dirty_percent=20
  options zfs zfs_vdev_scrub_min_active=48
  options zfs zfs_vdev_scrub_max_active=128
  options zfs zfs_vdev_sync_write_min_active=8
  options zfs zfs_vdev_sync_write_max_active=32
  options zfs zfs_vdev_sync_read_min_active=8
  options zfs zfs_vdev_sync_read_max_active=32
  options zfs zfs_vdev_async_read_min_active=8
  options zfs zfs_vdev_async_read_max_active=32
  options zfs zfs_top_maxinflight=320
  options zfs zfs_txg_timeout=30
  options zfs zfs_dirty_data_max_percent=40
  options zfs zfs_vdev_async_write_min_active=8
  options zfs zfs_vdev_async_write_max_active=32

- ZFS pool is configured with the following options:
  zfs set atime=off <pool>
  zfs set redundant_metadata=most <pool>
  zfs set xattr=sa <pool>
  zfs set recordsize=1M <pool>

- Set the grant_shrink option to 0 for all clients of Lustre:
  lctl set_param osc.*.grant_shrink=0

These are all we have learned so far. We are wondering whether
there are still something we have overlooked (e.g., are iommu
and intel_iommu settings of kernel parameters help)? We will be
very appreciated if anyone could give us further suggestions.


Best Regards,

T.H.Hsieh


More information about the lustre-discuss mailing list