[Lustre-discuss] number of inodes in zfs MDT

Thu Jun 12 18:43:59 PDT 2014

Just a note, I see zfs-0.6.3 has just been annoounced:

https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-announce/Lj7xHtRVOM4

I also see it is upgraded in the zfs/lustre repo.

The changelog notes the default as changed to 3/4 
arc_c_max and a variety of other fixes, many focusing on 
performance.

So Anjana this is probably worth testing, especially if 
you're considering drastic measures.

We upgraded for our MDS, so this file create issue is 
harder for us to test now (literally started testing 
writes this afternoon, and it's not degraded yet, so far 
at 20 million writes). Since your problem still happens 
fairly quickly I'm sure any information you have will be 
very helpful to add to LU-2476. And if it helps, it may 
save you some pain.

We will likely install the upgrade but may not be able to 
test millions of writes any time soon, as the filesystem 
is needed for production.

Regards,
Scott

On Thu, 12 Jun 2014 16:41:14 +0000
  "Dilger, Andreas" <andreas.dilger at intel.com> wrote:
> It looks like you've already increased arc_meta_limit 
>beyond the default, which is c_max / 4. That was critical 
>to performance in our testing.
> 
> There is also a patch from Brian that should help 
>performance in your case:
> http://review.whamcloud.com/10237
> 
> Cheers, Andreas
> 
> On Jun 11, 2014, at 12:53, "Scott Nolin" 
><scott.nolin at ssec.wisc.edu<mailto:scott.nolin at ssec.wisc.edu>> 
>wrote:
> 
> We tried a few arc tunables as noted here:
> 
> https://jira.hpdd.intel.com/browse/LU-2476
> 
> However, I didn't find any clear benefit in the long 
>term. We were just trying a few things without a lot of 
>insight.
> 
> Scott
> 
> On 6/9/2014 12:37 PM, Anjana Kar wrote:
> Thanks for all the input.
> 
> Before we move away from zfs MDT, I was wondering if we 
>can try setting zfs
> tunables to test the performance. Basically what's a 
>value we can use for
> arc_meta_limit for our system? Are there are any others 
>settings that can
> be changed?
> 
> Generating small files on our current system, things 
>started off at 500
> files/sec,
> then declined so it was about 1/20th of that after 2.45 
>million files.
> 
> -Anjana
> 
> On 06/09/2014 10:27 AM, Scott Nolin wrote:
> We ran some scrub performance tests, and even without 
>tunables set it
> wasn't too bad, for our specific configuration. The main 
>thing we did
> was verify it made sense to scrub all OSTs 
>simultaneously.
> 
> Anyway, indeed scrub or resilver aren't about Defrag.
> 
>Further, the mds performance issues aren't about 
>fragmentation.
> 
> A side note, it's probably ideal to stay below 80% due 
>to
> fragmentation for ldiskfs too or performance degrades.
> 
> Sean, note I am dealing with specific issues for a very 
>create intense
> workload, and this is on the mds only where we may 
>change. The data
> integrity features of Zfs make it very attractive too. I 
>fully expect
> things will improve too with Zfs.
> 
> If you want a lot of certainty in your choices, you may 
>want to
> consult various vendors if lustre systems.
> 
> Scott
> 
> 
> 
> 
> On June 8, 2014 11:42:15 AM CDT, "Dilger, Andreas"
> <andreas.dilger at intel.com<mailto:andreas.dilger at intel.com>> 
>wrote:
> 
>   Scrub and resilver have nothing to so with defrag.
> 
>   Scrub is scanning of all the data blocks in the pool 
>to verify their checksums and parity to detect silent 
>data corruption, and rewrite the bad blocks if necessary.
> 
>   Resilver is reconstructing a failed disk onto a new 
>disk using parity or mirror copies of all the blocks on 
>the failed disk. This is similar to scrub.
> 
>   Both scrub and resilver can be done online, though 
>resilver of course requires a spare disk to rebuild onto, 
>which may not be possible to add to a running system if 
>your hardware does not support it.
> 
>   Both of them do not "improve" the performance or 
>layout of data on disk. They do impact performance 
>because they cause a lot if random IO to the disks, 
>though this impact can be limited by tunables on the 
>pool.
> 
>   Cheers, Andreas
> 
>   On Jun 8, 2014, at 4:21, "Sean Brisbane" 
><s.brisbane1 at physics.ox.ac.uk<mailto:s.brisbane1 at physics.ox.ac.uk><mailto:s.brisbane1 at physics.ox.ac.uk>> 
>wrote:
> 
>   Hi Scott,
> 
>   We are considering running zfs backed lustre and the 
>factor of 10ish performance hit you see worries me. I 
>know zfs can splurge bits of files all over the place by 
>design. The oracle docs do recommend scrubbing the 
>volumes and keeping usage below 80% for maintenance and 
>performance reasons, I'm going to call it 'defrag' but 
>I'm sure someone who knows better will probably correct 
>me as to why it is not the same.
>   So are these performance issues after scubbing and is 
>it possible to scrub online - I.e. some reasonable level 
>of performance is maintained while the scrub happens?
>   Resilvering is also recommended. Not sure if that is 
>for performance reasons.
> 
>   http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html
> 
> 
> 
>   Sent from my HTC Desire C on Three
> 
>   ----- Reply message -----
>   From: "Scott Nolin" 
><scott.nolin at ssec.wisc.edu<mailto:scott.nolin at ssec.wisc.edu><mailto:scott.nolin at ssec.wisc.edu>>
>   To: "Anjana Kar" 
><kar at psc.edu<mailto:kar at psc.edu><mailto:kar at psc.edu>>, 
>"lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org>" 
><lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org>>
>   Subject: [Lustre-discuss] number of inodes in zfs MDT
>   Date: Fri, Jun 6, 2014 3:23 AM
> 
> 
> 
>   Looking at some of our existing zfs filesystems, we 
>have a couple with zfs mdts
> 
>   One has 103M inodes and uses 152G of MDT space, 
>another 12M and 19G. I’d plan for less than that I guess 
>as Mr. Dilger suggests. It all depends on your expected 
>average file size and number of files for what will work.
> 
>   We have run into some unpleasant surprises with zfs 
>for the MDT, I believe mostly documented in bug reports, 
>or at least hinted at.
> 
>   A serious issue we have is performance of the zfs arc 
>cache over time. This is something we didn’t see in early 
>testing, but with enough use it grinds things to a crawl. 
>I believe this may be addressed in the newer version of 
>ZFS, which we’re hopefully awaiting.
> 
>   Another thing we’ve seen, which is mysterious to me is 
>this it appears hat as the MDT begins to fill up file 
>create rates go down. We don’t really have a strong 
>handle on this (not enough for a bug report I think), but 
>we see this:
> 
> 
>      1.
>   The aforementioned 104M inode / 152GB MDT system has 4 
>SAS drives raid10. On initial testing file creates were 
>about 2500 to 3000 IOPs per second. Follow up testing in 
>it’s current state (about half full..) shows them at 
>about 500 IOPs now, but with a few iterations of mdtest 
>those IOPs plummet quickly to unbearable levels (like 
>30…).
>      2.
>   We took a snapshot of the filesystem and sent it to 
>the backup MDS, this time with the MDT built on 4 SAS 
>drives in a raid0 - really not for performance so much as 
>“extra headroom” if that makes any sense. Testing this 
>the IOPs started higher, at maybe 800 or 1000 (this is 
>from memory, I don’t have my data in front of me). That 
>initial faster speed could just be writing to 4 spindles 
>I suppose, but surprising to me, the performance degraded 
>at a slower rate. It took much longer to get painfully 
>slow. It still got there. The performance didn’t degrade 
>at the same rate if that makes sense - the same number of 
>writes on the smaller/slower mdt degraded the performance 
>more quickly.  My guess is that had something to do with 
>the total space available. Who knows. I believe 
>restarting lustre (and certainly rebooting) ‘resets the 
>clock’ on the file create performance degradation.
> 
>   For that problem we’re just going to try adding 4 
>SSD’s, but it’s an ugly problem. Also are once again 
>hopeful new zfs version addresses it.
> 
>   And finally, we’ve got a real concern with snapshot 
>backups of the MDT that my colleague posted about - the 
>problem we see manifests in essentially a read-only 
>recovered file system, so it’s a concern and not quite 
>terrifying.
> 
>   All in all, the next lustre file system we bring up 
>(in a couple weeks) we are very strongly considering 
>going with ldiskfs for the MDT this time.
> 
>   Scott
> 
> 
> 
> 
> 
> 
> 
> 
>   From: Anjana Kar<mailto:kar at psc.edu>
>   Sent: ‎Tuesday‎, ‎June‎ ‎3‎, ‎2014 ‎7‎:‎38‎ ‎PM
>   To:lustre-discuss at lists.lustre.org<mailto:discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org>
> 
>   Is there a way to set the number of inodes for zfs 
>MDT?
> 
>   I've tried using --mkfsoptions="-N value" mentioned in 
>lustre 2.0
>   manual, but it
>   fails to accept it. We are mirroring 2 80GB SSDs for 
>the MDT, but the
>   number of
>   inodes is getting set to 7 million, which is not 
>enough for a 100TB
>   filesystem.
> 
>   Thanks in advance.
> 
>   -Anjana Kar
>      Pittsburgh Supercomputing Center
>      kar at psc.edu<mailto:kar at psc.edu><mailto:kar at psc.edu>
>   ------------------------------------------------------------------------
> 
>   Lustre-discuss mailing list
>   Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org><mailto:Lustre-discuss at lists.lustre.org>
>   http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   ------------------------------------------------------------------------
> 
>   Lustre-discuss mailing list
>   Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org><mailto:Lustre-discuss at lists.lustre.org>
>   http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org>
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org>
> http://lists.lustre.org/mailman/listinfo/lustre-discuss