[lustre-discuss] bad performance with Lustre/ZFS on NVMe SSD

Alexander I Kulyavtsev aik at fnal.gov
Tue Apr 10 11:43:35 PDT 2018


Ricardo,
It can be helpful to see output of commands on zfs pool host when you read files through lustre client; and directly through zfs:

# zpool iostat -lq -y zpool_name 1
# zpool iostat -w -y zpool_name 5
# zpool iostat -r -y zpool_name 5

-q queue statistics
-l Latency statistics

-r Request size histogram:
-w (undocumented) latency statistics

I did see different behavior of zfs reads on zfs pool for the same dd/fio command reading file from lustre mount on different host; and directly from zfs on OSS. I created separate zfs dataset with similar zfs settings on lustre zpool.
lustre IO seen on zfs pool with 128KB requests while dd/fio on zfs has 1MB requests. dd/fio command used 1MB IO.

zptevlfs6     sync_read    sync_write    async_read    async_write      scrub   
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0
4K              0      0      0      0      0      0      0      0      0      0
8K              0      0      0      0      0      0      0      0      0      0
16K             0      0      0      0      0      0      0      0      0      0
32K             0      0      0      0      0      0      0      0      0      0
64K             0      0      0      0      0      0      0      0      0      0
128K            0      0      0      0  2.00K      0      0      0      0      0     <====
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0    125      0      0      0      0      0		<====
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------
^C

Alex.


On 4/9/18, 6:15 PM, "lustre-discuss on behalf of Dilger, Andreas" <lustre-discuss-bounces at lists.lustre.org on behalf of andreas.dilger at intel.com> wrote:

    On Apr 6, 2018, at 23:04, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
    > 
    > So I'm struggling since months with these low performances on Lsutre/ZFS.
    > 
    > Looking for hints.
    > 
    > 3 OSSes, RHEL 74  Lustre 2.10.3 and zfs 0.7.6
    > 
    > each OSS has one  OST raidz
    > 
    >   pool: drpffb-ost01
    >  state: ONLINE
    >   scan: none requested
    >   trim: completed on Fri Apr  6 21:53:04 2018 (after 0h3m)
    > config:
    > 
    >     NAME          STATE     READ WRITE CKSUM
    >     drpffb-ost01  ONLINE       0     0     0
    >       raidz1-0    ONLINE       0     0     0
    >         nvme0n1   ONLINE       0     0     0
    >         nvme1n1   ONLINE       0     0     0
    >         nvme2n1   ONLINE       0     0     0
    >         nvme3n1   ONLINE       0     0     0
    >         nvme4n1   ONLINE       0     0     0
    >         nvme5n1   ONLINE       0     0     0
    > 
    > while the raidz without Lustre perform well at 6GB/s (1GB/s per disk),
    > with Lustre on top of it performances are really poor.
    > most of all they are not stable at all and go up and down between
    > 1.5GB/s and 6GB/s. I Tested with obfilter-survey
    > LNET is ok and working at 6GB/s (using infiniband FDR)
    > 
    > What could be the cause of OST performance going up and down like a
    > roller coaster ?
    
    Riccardo,
    to take a step back for a minute, have you tested all of the devices
    individually, and also concurrently with some low-level tool like
    sgpdd or vdbench?  After that is known to be working, have you tested
    with obdfilter-survey locally on the OSS, then remotely on the client(s)
    so that we can isolate where the bottleneck is being hit.
    
    Cheers, Andreas
    
    
    > for reference here are few considerations:
    > 
    > filesystem parameters:
    > 
    > zfs set mountpoint=none drpffb-ost01
    > zfs set sync=disabled drpffb-ost01
    > zfs set atime=off drpffb-ost01
    > zfs set redundant_metadata=most drpffb-ost01
    > zfs set xattr=sa drpffb-ost01
    > zfs set recordsize=1M drpffb-ost01
    > 
    > NVMe SSD are  4KB/sector
    > 
    > ashift=12
    > 
    > 
    > ZFS module parameters
    > 
    > options zfs zfs_prefetch_disable=1
    > options zfs zfs_txg_history=120
    > options zfs metaslab_debug_unload=1
    > #
    > options zfs zfs_vdev_scheduler=deadline
    > options zfs zfs_vdev_async_write_active_min_dirty_percent=20
    > #
    > options zfs zfs_vdev_scrub_min_active=48
    > options zfs zfs_vdev_scrub_max_active=128
    > #options zfs zfs_vdev_sync_write_min_active=64
    > #options zfs zfs_vdev_sync_write_max_active=128
    > #
    > options zfs zfs_vdev_sync_write_min_active=8
    > options zfs zfs_vdev_sync_write_max_active=32
    > options zfs zfs_vdev_sync_read_min_active=8
    > options zfs zfs_vdev_sync_read_max_active=32
    > options zfs zfs_vdev_async_read_min_active=8
    > options zfs zfs_vdev_async_read_max_active=32
    > options zfs zfs_top_maxinflight=320
    > options zfs zfs_txg_timeout=30
    > options zfs zfs_dirty_data_max_percent=40
    > options zfs zfs_vdev_scheduler=deadline
    > options zfs zfs_vdev_async_write_min_active=8
    > options zfs zfs_vdev_async_write_max_active=32
    > 
    Cheers, Andreas
    --
    Andreas Dilger
    Lustre Principal Architect
    Intel Corporation
    
    
    
    
    
    
    
    _______________________________________________
    lustre-discuss mailing list
    lustre-discuss at lists.lustre.org
    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
    




More information about the lustre-discuss mailing list