[lustre-discuss] bad performance with Lustre/ZFS on NVMe SSD
Alexander I Kulyavtsev
aik at fnal.gov
Tue Apr 10 11:43:35 PDT 2018
Ricardo,
It can be helpful to see output of commands on zfs pool host when you read files through lustre client; and directly through zfs:
# zpool iostat -lq -y zpool_name 1
# zpool iostat -w -y zpool_name 5
# zpool iostat -r -y zpool_name 5
-q queue statistics
-l Latency statistics
-r Request size histogram:
-w (undocumented) latency statistics
I did see different behavior of zfs reads on zfs pool for the same dd/fio command reading file from lustre mount on different host; and directly from zfs on OSS. I created separate zfs dataset with similar zfs settings on lustre zpool.
lustre IO seen on zfs pool with 128KB requests while dd/fio on zfs has 1MB requests. dd/fio command used 1MB IO.
zptevlfs6 sync_read sync_write async_read async_write scrub
req_size ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 0 0 0 0 0 0 0 0 0 0
1K 0 0 0 0 0 0 0 0 0 0
2K 0 0 0 0 0 0 0 0 0 0
4K 0 0 0 0 0 0 0 0 0 0
8K 0 0 0 0 0 0 0 0 0 0
16K 0 0 0 0 0 0 0 0 0 0
32K 0 0 0 0 0 0 0 0 0 0
64K 0 0 0 0 0 0 0 0 0 0
128K 0 0 0 0 2.00K 0 0 0 0 0 <====
256K 0 0 0 0 0 0 0 0 0 0
512K 0 0 0 0 0 0 0 0 0 0
1M 0 0 0 0 125 0 0 0 0 0 <====
2M 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0
--------------------------------------------------------------------------------
^C
Alex.
On 4/9/18, 6:15 PM, "lustre-discuss on behalf of Dilger, Andreas" <lustre-discuss-bounces at lists.lustre.org on behalf of andreas.dilger at intel.com> wrote:
On Apr 6, 2018, at 23:04, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
>
> So I'm struggling since months with these low performances on Lsutre/ZFS.
>
> Looking for hints.
>
> 3 OSSes, RHEL 74 Lustre 2.10.3 and zfs 0.7.6
>
> each OSS has one OST raidz
>
> pool: drpffb-ost01
> state: ONLINE
> scan: none requested
> trim: completed on Fri Apr 6 21:53:04 2018 (after 0h3m)
> config:
>
> NAME STATE READ WRITE CKSUM
> drpffb-ost01 ONLINE 0 0 0
> raidz1-0 ONLINE 0 0 0
> nvme0n1 ONLINE 0 0 0
> nvme1n1 ONLINE 0 0 0
> nvme2n1 ONLINE 0 0 0
> nvme3n1 ONLINE 0 0 0
> nvme4n1 ONLINE 0 0 0
> nvme5n1 ONLINE 0 0 0
>
> while the raidz without Lustre perform well at 6GB/s (1GB/s per disk),
> with Lustre on top of it performances are really poor.
> most of all they are not stable at all and go up and down between
> 1.5GB/s and 6GB/s. I Tested with obfilter-survey
> LNET is ok and working at 6GB/s (using infiniband FDR)
>
> What could be the cause of OST performance going up and down like a
> roller coaster ?
Riccardo,
to take a step back for a minute, have you tested all of the devices
individually, and also concurrently with some low-level tool like
sgpdd or vdbench? After that is known to be working, have you tested
with obdfilter-survey locally on the OSS, then remotely on the client(s)
so that we can isolate where the bottleneck is being hit.
Cheers, Andreas
> for reference here are few considerations:
>
> filesystem parameters:
>
> zfs set mountpoint=none drpffb-ost01
> zfs set sync=disabled drpffb-ost01
> zfs set atime=off drpffb-ost01
> zfs set redundant_metadata=most drpffb-ost01
> zfs set xattr=sa drpffb-ost01
> zfs set recordsize=1M drpffb-ost01
>
> NVMe SSD are 4KB/sector
>
> ashift=12
>
>
> ZFS module parameters
>
> options zfs zfs_prefetch_disable=1
> options zfs zfs_txg_history=120
> options zfs metaslab_debug_unload=1
> #
> options zfs zfs_vdev_scheduler=deadline
> options zfs zfs_vdev_async_write_active_min_dirty_percent=20
> #
> options zfs zfs_vdev_scrub_min_active=48
> options zfs zfs_vdev_scrub_max_active=128
> #options zfs zfs_vdev_sync_write_min_active=64
> #options zfs zfs_vdev_sync_write_max_active=128
> #
> options zfs zfs_vdev_sync_write_min_active=8
> options zfs zfs_vdev_sync_write_max_active=32
> options zfs zfs_vdev_sync_read_min_active=8
> options zfs zfs_vdev_sync_read_max_active=32
> options zfs zfs_vdev_async_read_min_active=8
> options zfs zfs_vdev_async_read_max_active=32
> options zfs zfs_top_maxinflight=320
> options zfs zfs_txg_timeout=30
> options zfs zfs_dirty_data_max_percent=40
> options zfs zfs_vdev_scheduler=deadline
> options zfs zfs_vdev_async_write_min_active=8
> options zfs zfs_vdev_async_write_max_active=32
>
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
More information about the lustre-discuss
mailing list