[Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]

Tue May 3 13:03:12 PDT 2011

Just to follow up in this issue. We landed a patch for 2.1 that will reduce the default OST cache to objects 8MB or smaller.  This can still be tuned via /proc, but is likely to provide better all-around performance by avoiding cache flushes for streaming read and write operations.

Robin, it would be great to know if tuning this would also solve your cache pressure woes without having to resort to disabling the VM cache pressure (which isn't something we can do by default for all users).

Cheers, Andreas

On 2011-02-09, at 8:11 AM, Robin Humble <robin.humble+lustre at anu.edu.au> wrote:

> <rejoining this topic after a couple of weeks of experimentation>
> 
> Re: trying to improve metadata performance ->
> 
> we've been running with vfs_cache_pressure=0 on OSS's in production for
> over a week now and it's improved our metadata performance by a large factor.
> 
> - filesystem scans that didn't finish in ~30hrs now complete in a little
>   over 3 hours. so >~10x speedup.
> 
> - a recursive ls -altrR of my home dir (on a random uncached client) now
>   runs at 2000 to 4000 files/s wheras before it could be <100 files/s.
>   so 20 to 40x speedup.
> 
> of course vfs_cache_pressure=0 can be a DANGEROUS setting because
> inodes/dentries will never be reclaimed, so OSS's could OOM.
> 
> however slabtop shows inodes are 0.89K and dentries 0.21K ie. small, so
> I expect many sites can (like us) easily cache everything. for a given
> number of inodes per OST it's easily calculable whether there's enough
> OSS ram to safely set vfs_cache_pressure=0 and cache them all in slab.
> 
> continued monitoring of the fs inode growth (== OSS slab size) over
> time is very important as fs's will inevitably acrue more files...
> 
> sadly a slightly less extreme vfs_cache_pressure=1 wasn't as successful
> at keeping stat rates high. sustained OSS cache memory pressure through
> the day dropped enough inodes that nightly scans weren't fast any more.
> 
> our current residual issue with vfs_cache_pressure=0 is unexpected.
> the number of OSS dentries appears to slowly grow over time :-/
> it appears that some/many dentries for deleted files are not reclaimed
> without some memory pressure.
> any idea why that might be?
> 
> anyway, I've now added a few lines of code to create a different
> (non-zero) vfs_cache_pressure knob for dentries. we'll see how that
> goes...
> an alternate (simpler) workaround would be to occasionally drop OSS
> inode/dentry caches, or to set vfs_cache_pressure=100 once in a while,
> and to just live with a day of slow stat's while the inode caches
> repopulate.
> 
> hopefully vfs_cache_pressure=0 also has a net small positive impact on
> regular i/o due to reduced iops to OSTs, but I haven't trid to measure
> that.
> slab didn't steal much ram from our read and write_through caches (we
> have 48g ram on OSS's and slab went up about 1.6g to 3.3g with the
> additional cached inodes/dentries) so OSS file caching should be
> almost unaffected.
> 
> On Fri, Jan 28, 2011 at 09:45:10AM -0800, Jason Rappleye wrote:
>> On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:
>>> limiting the total amount of OSS cache used in order to leave room for
>>> inodes/dentries might be more useful. the data cache will always fill
>>> up and push out inodes otherwise.
> 
> I disagree with myself now. I think mm/vmscan.c would probably still
> call shrink_slab, so shrinkers would get called and some cached inodes
> would get dropped.
> 
>> The inode and dentry objects in the slab cache aren't so much of an issue as having the disk blocks that each are generated from available in the buffer cache. Constructing the in-memory inode and dentry objects is cheap as long as the corresponding disk blocks are available. Doing the disk reads, depending on your hardware and some other factors, is not.
> 
> on a test cluster (with read and write_through caches still active and
> synthetic i/o load) I didn't see a big change in stat rate from
> dropping OSS page/buffer cache - at most a slowdown for a client
> 'ls -lR' of ~2x, and usually no slowdown at all. I suspect this is
> because there is almost zero persistent buffer cache due to the OSS
> buffer and page caches being punished by file i/o.
> in the same testing, dropping OSS inode/dentry caches was a much larger
> effect (up to 60x slowdown with synthetic i/o) - which is why the
> vfs_cache_pressure setting works.
> the synthetic i/o wasn't crazily intensive, but did have a working
> set >>OSS mem which is likely true of our production machine.
> 
> however for your setup with OSS caches off, and from doing tests on our
> MDS, I agree that buffer caches can be a big effect.
> 
> dropping our MDS buffer cache slows down a client 'lfs find' by ~4x,
> but dropping inode/dentry caches doesn't slow it down at all, so
> buffers are definitely important there.
> happily we're not under any memory pressure on our MDS's at the
> moment.
> 
>> We went the extreme and disabled the OSS read cache (+ writethrough cache). In addition, on the OSSes we pre-read all of the inode blocks that contain at least one used inode, along with all of the directory blocks. 
>> 
>> The results have been promising so far. Firing off a du on an entire filesystem, 3000-6000 stats/second is typical. I've noted a few causes of slowdowns so far; there may be more.
> 
> we see about 2k files/s on the nightly sweeps now. that's with one
> lfs find running and piping to parallel stat's. I think we can do
> better with more parallelism in the finds, but 2k is so much better
> than what it used to be we're fairly happy for now.
> 
> 2k isn't as good as your stat rates, but we still have OSS caches on,
> so the rest of our i/o should be benefiting from that.
> 
>> When memory runs low on a client, kswapd kicks in to try and free up pages. On the client I'm currently testing on, almost all of the memory used is in the slab. It looks like kswapd has a difficult time clearing things up, and the client can go several seconds before the current stat call is completed. Dropping caches will (temporarily) get the performance back to expected rates. I haven't dug into this one too much yet.
> 
> the last para of my prev email might help you.
> we found client slab is hard to reclaim without limiting ldlm locks.
> I haven't noticed a performance change from limiting ldlm lock counts.
> 
> cheers,
> robin
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss