[lustre-discuss] Lustre on ZFS pooer direct I/O performance

Sat Oct 15 22:01:47 PDT 2016

On 14/10/16 14:38, Dilger, Andreas wrote:
>
> John, with newer Lustre clients it is possible for multiple threads to 
> submit non-overlapping writes concurrently (also not conflicting 
> within a single page), see LU-1669 for details.
>
> Even so, O_DIRECT writes need to be synchronous to disk on the OSS, as 
> Patrick reports, because if the OSS fails before the write is on disk 
> there is no cached copy of the data on the client that can be used to 
> resend the RPC.
>
> The problem is that the ZFS OSD has very long transaction commit times 
> for synchronous writes because it does not yet have support for the 
> ZIL.  Using buffered writes, or having very large O_DIRECT writes 
> (e.g. 40MB or larger) and large RPCs (4MB, or up to 16MB in 2.9.0) to 
> amortize the sync overhead may be beneficial if you really want to use 
> O_DIRECT.
>
> Riccardo,
>
> The other potential issue is that you have 20 OSTs on a single OSS, 
> which isn't going to have very good performance.  Spreading the OSTs 
> across multiple OSS nodes is going to improve your performance 
> significantly when there are multiple clients writing, as there will 
> be N times the OSS network bandwidth, N times the CPU, N times the 
> RAM.  It only makes sense to have 20 OSTs/OSS if your workload is only 
> a single client and you want the maximum possible capacity for a given 
> cost.
>

Hello Andreas,
each OST has a separate VDEV and separate zpool.
thank you

> Is each OST a separate VDEV and separate zpool, or are they a single 
> zpool?  Separate zpools have less overhead for maximum performance, 
> but only one VDEV per zpool means that metadata ditto blocks are 
> written twice per RAID-Z2 VDEV, which isn't very efficient.  Having at 
> least 3 VDEVs per zpool is better in this regard.
>
> Cheers, Andreas
>
> -- 
>
> Andreas Dilger
>
> Lustre Principal Architect
>
> Intel High Performance Data Division
>
> On 2016/10/14, 15:22, "John Bauer" <bauerj at iodoctors.com 
> <mailto:bauerj at iodoctors.com>> wrote:
>
> Patrick
>
> I thought at one time there was an inode lock held for the duration of 
> the direct I/O read or write. So that even if one had multiple 
> application threads writing direct, only one was "in flight" at a 
> time. Has that changed?
>
> John
>
> Sent from my iPhone
>
>
> On Oct 14, 2016, at 3:16 PM, Patrick Farrell <paf at cray.com 
> <mailto:paf at cray.com>> wrote:
>
>     Sorry, I phrased one thing wrong:
>     I said "transferring to the network", but it's actually until it's
>     received confirmation the data has been received successfully, I
>     believe.
>
>     In any case, only one I/O (per thread) can be outstanding at a
>     time with direct I/O.
>
>     ------------------------------------------------------------------------
>
>     *From:*lustre-discuss <lustre-discuss-bounces at lists.lustre.org
>     <mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of
>     Patrick Farrell <paf at cray.com <mailto:paf at cray.com>>
>     *Sent:* Friday, October 14, 2016 3:12:22 PM
>     *To:* Riccardo Veraldi; lustre-discuss at lists.lustre.org
>     <mailto:lustre-discuss at lists.lustre.org>
>     *Subject:* Re: [lustre-discuss] Lustre on ZFS pooer direct I/O
>     performance
>
>     Riccardo,
>
>     While the difference is extreme, direct I/O write performance will
>     always be poor.  Direct I/O writes cannot be asynchronous, since
>     they don't use the page cache.  This means Lustre cannot return
>     from one write (and start the next) until it has finished
>     transferring the data to the network.
>
>     This means you can only have one I/O in flight at a time. Good
>     write performance from Lustre (or any network filesystem) depends
>     on keeping a lot of data in flight at once.
>
>     What sort of direct write performance were you hoping for? It will
>     never match that 800 MB/s from one thread you see with buffered I/O.
>
>     - Patrick
>
>     ------------------------------------------------------------------------
>
>     *From:*lustre-discuss <lustre-discuss-bounces at lists.lustre.org
>     <mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of
>     Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it
>     <mailto:Riccardo.Veraldi at cnaf.infn.it>>
>     *Sent:* Friday, October 14, 2016 2:22:32 PM
>     *To:* lustre-discuss at lists.lustre.org
>     <mailto:lustre-discuss at lists.lustre.org>
>     *Subject:* [lustre-discuss] Lustre on ZFS pooer direct I/O
>     performance
>
>     Hello,
>
>     I would like how may I improve the situation of my lustre cluster.
>
>     I have 1 MDS and 1 OSS with 20 OST defined.
>
>     Each OST is a 8x Disks RAIDZ2.
>
>     A single process write performance is around 800MB/sec
>
>     anyway if I force direct I/O, for example using oflag=direct in
>     dd, the
>     write performance drop as low as 8MB/sec
>
>     with 1MB block size. And each write it's about 120ms latency.
>
>     I used these ZFS settings
>
>     options zfs zfs_prefetch_disable=1
>     options zfs zfs_txg_history=120
>     options zfs metaslab_debug_unload=1
>
>     i am quite worried for the low performance.
>
>     Any hints or suggestions that may help me to improve the situation ?
>
>
>     thank you
>
>
>     Rick
>
>
>     _______________________________________________
>     lustre-discuss mailing list
>     lustre-discuss at lists.lustre.org
>     <mailto:lustre-discuss at lists.lustre.org>
>     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>     _______________________________________________
>     lustre-discuss mailing list
>     lustre-discuss at lists.lustre.org
>     <mailto:lustre-discuss at lists.lustre.org>
>     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20161015/ef062ecd/attachment-0001.htm>