[lustre-discuss] bad performance with Lustre/ZFS on NVMe SSD

Riccardo Veraldi Riccardo.Veraldi at cnaf.infn.it
Thu Apr 12 17:07:36 PDT 2018


Yes I tested every single disk and also with disks in a raidz pool
without Lustre.
disks perform to specs, 1.2TB each and up to 6GB/s in the zpool.
When using lustre the zpool performs really bad no more than 1.5GB/s.

I then configured one OST per disk without any raidz (6 OST total).
I can scale up with performance distributing processes across OSTs in
this way, but anyway if I use striping across all OSTs
instead of manually bounding proesses to a specific OST, the performance
decreases.
Also running a single process on a single OST I never can get more than
700MB/s while I can reach 1.2GB/s using at least 4 processes on the same
OST.

I did test using obdfilter-survey this is what I got:

ost  1 sz 524288000K rsz 1024K obj    4 thr    4 write 4872.92 [1525.83,
6120.75]

I did run Lnet selftest and I got 6GB/s using FDR.

But when I write form the client side the performances really drops
dramatically. Especially when using a Lustre on raidz.

so I Was wondering if  there is any RPC parameter setting that I need to
set to get better performances out of Lustre ?

thank you

On 4/9/18 4:15 PM, Dilger, Andreas wrote:
> On Apr 6, 2018, at 23:04, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
>> So I'm struggling since months with these low performances on Lsutre/ZFS.
>>
>> Looking for hints.
>>
>> 3 OSSes, RHEL 74  Lustre 2.10.3 and zfs 0.7.6
>>
>> each OSS has one  OST raidz
>>
>>   pool: drpffb-ost01
>>  state: ONLINE
>>   scan: none requested
>>   trim: completed on Fri Apr  6 21:53:04 2018 (after 0h3m)
>> config:
>>
>>     NAME          STATE     READ WRITE CKSUM
>>     drpffb-ost01  ONLINE       0     0     0
>>       raidz1-0    ONLINE       0     0     0
>>         nvme0n1   ONLINE       0     0     0
>>         nvme1n1   ONLINE       0     0     0
>>         nvme2n1   ONLINE       0     0     0
>>         nvme3n1   ONLINE       0     0     0
>>         nvme4n1   ONLINE       0     0     0
>>         nvme5n1   ONLINE       0     0     0
>>
>> while the raidz without Lustre perform well at 6GB/s (1GB/s per disk),
>> with Lustre on top of it performances are really poor.
>> most of all they are not stable at all and go up and down between
>> 1.5GB/s and 6GB/s. I Tested with obfilter-survey
>> LNET is ok and working at 6GB/s (using infiniband FDR)
>>
>> What could be the cause of OST performance going up and down like a
>> roller coaster ?
> Riccardo,
> to take a step back for a minute, have you tested all of the devices
> individually, and also concurrently with some low-level tool like
> sgpdd or vdbench?  After that is known to be working, have you tested
> with obdfilter-survey locally on the OSS, then remotely on the client(s)
> so that we can isolate where the bottleneck is being hit.
>
> Cheers, Andreas
>
>
>> for reference here are few considerations:
>>
>> filesystem parameters:
>>
>> zfs set mountpoint=none drpffb-ost01
>> zfs set sync=disabled drpffb-ost01
>> zfs set atime=off drpffb-ost01
>> zfs set redundant_metadata=most drpffb-ost01
>> zfs set xattr=sa drpffb-ost01
>> zfs set recordsize=1M drpffb-ost01
>>
>> NVMe SSD are  4KB/sector
>>
>> ashift=12
>>
>>
>> ZFS module parameters
>>
>> options zfs zfs_prefetch_disable=1
>> options zfs zfs_txg_history=120
>> options zfs metaslab_debug_unload=1
>> #
>> options zfs zfs_vdev_scheduler=deadline
>> options zfs zfs_vdev_async_write_active_min_dirty_percent=20
>> #
>> options zfs zfs_vdev_scrub_min_active=48
>> options zfs zfs_vdev_scrub_max_active=128
>> #options zfs zfs_vdev_sync_write_min_active=64
>> #options zfs zfs_vdev_sync_write_max_active=128
>> #
>> options zfs zfs_vdev_sync_write_min_active=8
>> options zfs zfs_vdev_sync_write_max_active=32
>> options zfs zfs_vdev_sync_read_min_active=8
>> options zfs zfs_vdev_sync_read_max_active=32
>> options zfs zfs_vdev_async_read_min_active=8
>> options zfs zfs_vdev_async_read_max_active=32
>> options zfs zfs_top_maxinflight=320
>> options zfs zfs_txg_timeout=30
>> options zfs zfs_dirty_data_max_percent=40
>> options zfs zfs_vdev_scheduler=deadline
>> options zfs zfs_vdev_async_write_min_active=8
>> options zfs zfs_vdev_async_write_max_active=32
>>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>




More information about the lustre-discuss mailing list