[lustre-discuss] lustre OSC and system cache

Dilger, Andreas andreas.dilger at intel.com
Wed Dec 14 16:08:55 PST 2016


On Dec 12, 2016, at 19:28, John Bauer <bauerj at iodoctors.com> wrote:
> 
> Andreas
> 
> The file system has lru_max_age=9000000.  I have been googling around to find out what this controls, but haven't found much.

This controls caps long Lustre DLM locks are kept on a client.  They
may be removed earlier due to lock contention from other clients, or
if the client has a large number of DLM locks.  This should mean the
max age is 9000s = 2.5h, but if your kernel is configured with HZ=100
it may be 25h...  Yes, that is confusing, and it is being fixed. 

> Is there documentation on how the memory management works with Lustre?  I wonder what the lru actually means.  How is it that 2 files on the same node are not controlled by the same lru mechanism, as SCR300's pages are being lru'ed out when they are clearly used more recently than any in SCRATCH?

There are different LRUs for different types of objects.  The DLM LRU
(controlled with ldlm.namespaces.*.{lru_max_age,lru_size}) controls
Lustre DLM lock expiration, while the kernel pagecache LRU controls
page aging.

Depending on your access patterns, it may be that the initial file
reads are keeping pages in cache because they were accessed multiple
times, while the new pages don't have time to be accessed multiple
times before eviction, and there aren't enough different locks in
the DLM LRU to evict them?  Reducing the maximum DLM lock age would
help to avoid similar problems.

Cheers, Andreas

> Thanks
> 
> John
> 
> 
> On 12/12/2016 6:59 PM, Dilger, Andreas wrote:
>> On Dec 12, 2016, at 15:50, John Bauer <bauerj at iodoctors.com> wrote:
>>> I'm observing some undesirable caching of OSC data in the system buffers.  This is a single node, single process application.  There are 2 files of interest, SCRATCH and SCR300,  both are scratch files with stripeCount=4.  The system has 128GB of memory.  Lustre maxes out at about 59GB of memory used for caching.
>>> SCRATCH,  About 22GB is written/read during the first 300 seconds of the run.  No further activity to the file ( but remains open ) until about 18,700 seconds into the run when another 22GB is written/read.  Illustrated in the top frame of the first plot below.  In the bottom frame of the first plot is the amount of system cache used by each of the 4 OSC's associated with the file over the course of the run ( nearly identical, as would be expected ).  Note that each the OSC's retains its 5.5GB of memory even though nothing is happening to the file.
>>> SCR300,  A 110GB file, written and repeatedly read between the times of the above SCRATCH file's I/O.
>>> 
>>> What is of interest it that while SCR300 is doing all its I/O, and its associated OSC's are fighting each other for caching memory, the 4 OSC's for the inactive file(SCRATCH) retain their 22GB of memory.  Why are the 4 OSC's for the inactive file exempt from giving up their memory?  It is very reproducible.
>> You don't mention what Lustre version you are using, which makes it hard
>> to comment specifically.  That said, you could try reducing the lock LRU
>> age, which was changed by default in the 2.8 or 2.9 release to 3900s
>> (65 minutes) instead of 36000s (10h) via:
>> 
>>         lctl set_param ldlm.namespaces.*.lru_max_age=3900000
>> 
>> (though check what your current setting is, since the units are in
>> "jiffies" (HZ) and that may differ depending on kernel compile options).
>> 
>> Cheers, Andreas
>> 
>>> The application is MSC.Nastran, which has the capability to put the data for SCR300 inside of SCRATCH(increasing its size to 132GB).  If run in this mode, the caching behavior is much better behaved and the job runs in 11,500 seconds, versus 19,000.  Illustrated in 3rd plot below.  While this is a solution for this case, it is not a general solution.
>>> 
>>> Thanks
>>> 
>>> John
>>> Plots for SCRATCH
>>> <bfoimgfaenjmgmii.png>
>>> 
>>> 
>>> Plots for SCR300
>>> 
>>> <mncccijbfkiekmmn.png>
>>> 
>>> 
>>> Plots for SCR300 inside of SCRATCH
>>> 
>>> <adnondhpelpohhjf.png>
>>> -- 
>>> I/O Doctors, LLC
>>> 507-766-0378
>>> 
>>> bauerj at iodoctors.com
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> -- 
> I/O Doctors, LLC
> 507-766-0378
> bauerj at iodoctors.com
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list