[Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]

Wed Feb 9 07:11:20 PST 2011

<rejoining this topic after a couple of weeks of experimentation>

Re: trying to improve metadata performance ->

we've been running with vfs_cache_pressure=0 on OSS's in production for
over a week now and it's improved our metadata performance by a large factor.

 - filesystem scans that didn't finish in ~30hrs now complete in a little
   over 3 hours. so >~10x speedup.

 - a recursive ls -altrR of my home dir (on a random uncached client) now
   runs at 2000 to 4000 files/s wheras before it could be <100 files/s.
   so 20 to 40x speedup.

of course vfs_cache_pressure=0 can be a DANGEROUS setting because
inodes/dentries will never be reclaimed, so OSS's could OOM.

however slabtop shows inodes are 0.89K and dentries 0.21K ie. small, so
I expect many sites can (like us) easily cache everything. for a given
number of inodes per OST it's easily calculable whether there's enough
OSS ram to safely set vfs_cache_pressure=0 and cache them all in slab.

continued monitoring of the fs inode growth (== OSS slab size) over
time is very important as fs's will inevitably acrue more files...

sadly a slightly less extreme vfs_cache_pressure=1 wasn't as successful
at keeping stat rates high. sustained OSS cache memory pressure through
the day dropped enough inodes that nightly scans weren't fast any more.

our current residual issue with vfs_cache_pressure=0 is unexpected.
the number of OSS dentries appears to slowly grow over time :-/
it appears that some/many dentries for deleted files are not reclaimed
without some memory pressure.
any idea why that might be?

anyway, I've now added a few lines of code to create a different
(non-zero) vfs_cache_pressure knob for dentries. we'll see how that
goes...
an alternate (simpler) workaround would be to occasionally drop OSS
inode/dentry caches, or to set vfs_cache_pressure=100 once in a while,
and to just live with a day of slow stat's while the inode caches
repopulate.

hopefully vfs_cache_pressure=0 also has a net small positive impact on
regular i/o due to reduced iops to OSTs, but I haven't trid to measure
that.
slab didn't steal much ram from our read and write_through caches (we
have 48g ram on OSS's and slab went up about 1.6g to 3.3g with the
additional cached inodes/dentries) so OSS file caching should be
almost unaffected.

On Fri, Jan 28, 2011 at 09:45:10AM -0800, Jason Rappleye wrote:
>On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:
>> limiting the total amount of OSS cache used in order to leave room for
>> inodes/dentries might be more useful. the data cache will always fill
>> up and push out inodes otherwise.

I disagree with myself now. I think mm/vmscan.c would probably still
call shrink_slab, so shrinkers would get called and some cached inodes
would get dropped.

>The inode and dentry objects in the slab cache aren't so much of an issue as having the disk blocks that each are generated from available in the buffer cache. Constructing the in-memory inode and dentry objects is cheap as long as the corresponding disk blocks are available. Doing the disk reads, depending on your hardware and some other factors, is not.

on a test cluster (with read and write_through caches still active and
synthetic i/o load) I didn't see a big change in stat rate from
dropping OSS page/buffer cache - at most a slowdown for a client
'ls -lR' of ~2x, and usually no slowdown at all. I suspect this is
because there is almost zero persistent buffer cache due to the OSS
buffer and page caches being punished by file i/o.
in the same testing, dropping OSS inode/dentry caches was a much larger
effect (up to 60x slowdown with synthetic i/o) - which is why the
vfs_cache_pressure setting works.
the synthetic i/o wasn't crazily intensive, but did have a working
set >>OSS mem which is likely true of our production machine.

however for your setup with OSS caches off, and from doing tests on our
MDS, I agree that buffer caches can be a big effect.

dropping our MDS buffer cache slows down a client 'lfs find' by ~4x,
but dropping inode/dentry caches doesn't slow it down at all, so
buffers are definitely important there.
happily we're not under any memory pressure on our MDS's at the
moment.

>We went the extreme and disabled the OSS read cache (+ writethrough cache). In addition, on the OSSes we pre-read all of the inode blocks that contain at least one used inode, along with all of the directory blocks. 
>
>The results have been promising so far. Firing off a du on an entire filesystem, 3000-6000 stats/second is typical. I've noted a few causes of slowdowns so far; there may be more.

we see about 2k files/s on the nightly sweeps now. that's with one
lfs find running and piping to parallel stat's. I think we can do
better with more parallelism in the finds, but 2k is so much better
than what it used to be we're fairly happy for now.

2k isn't as good as your stat rates, but we still have OSS caches on,
so the rest of our i/o should be benefiting from that.

>When memory runs low on a client, kswapd kicks in to try and free up pages. On the client I'm currently testing on, almost all of the memory used is in the slab. It looks like kswapd has a difficult time clearing things up, and the client can go several seconds before the current stat call is completed. Dropping caches will (temporarily) get the performance back to expected rates. I haven't dug into this one too much yet.

the last para of my prev email might help you.
we found client slab is hard to reclaim without limiting ldlm locks.
I haven't noticed a performance change from limiting ldlm lock counts.

cheers,
robin