[Lustre-discuss] Can't increase effective client read cache

Wed Sep 25 10:02:49 PDT 2013

On 2013-09-24, at 14:15, "Carlson, Timothy S" <Timothy.Carlson at pnnl.gov> wrote:

> I've got an odd situation that I can't seem to fix. 
> 
> My setup is Lustre 1.8.8-wc1 clients on RHEL 6 talking to 1.8.6 servers on RHEL 5.
> 
> My compute nodes have 64 GB of memory and I have a use case where an application has very low memory usage and needs to access a few thousand files in Lustre that range from 10 to 50 MB.  The files are subject to some reuse and it would be advantageous to cache as much of the data as possible.  The default cache for this configuration would be 48GB on the client as that is 75% of memory.   However the client never caches more than about 40GB of data according to /proc/meminfo 
> 
> Even if I tune the cached memory to 64GB the amount of cache in use never goes past 40GB. My current setting is as follows
> 
> # lctl get_param llite.*.max_cached_mb
> llite.olympus-ffff8804069da800.max_cached_mb=64000
> 
> I've also played with some of the VM tunable settings.  Like running vfs_cache_pressure down to 10
> 
> # vm.vfs_cache_pressure = 10
> 
> In no case do I see more than about 35GB of cache being used.   To do some more testing on this I created a bunch (40) 2G files in Lustre and then copied them to /dev/null on the client. While doing this I ran the fincore tool from http://code.google.com/p/linux-ftools/ to see if the file was still in cache. Once about 40GB of cache was used, the kernel started to drop files from the cache even though there was no memory pressure on the system. 
> 
> If I do the same test with files local to the system, I can fill all the cache to about 61GB before files start getting dropped. 
> 
> Is there some other Lustre tunable on the client that I can twiddle with to make more use of the local memory cache?

This might relate to the number of DLM locks cached on the client. Of the locks get cancelled for some reason (e.g. memory pressure on the server, old age) then the pages covered by the locks will also be dropped.

You could try disabling the lock LRU and specify some large static number of locks (for testing, I wouldn't leave this set for production systems with large numbers of clients):

    lctl set_param ldlm.namespaces.*.lru_size=10000

To reset it to dynamic DLM LRU size management set a value of "0". 

Cheers, Andread