[lustre-discuss] [EXTERNAL] Limiting Lustre memory use?

Tue Feb 22 09:54:11 PST 2022

On 2/22/22 06:33, Ellis Wilson wrote:
> Hi Bill,
>
> I just ran into a similar issue.  See:
> https://jira.whamcloud.com/browse/LU-15468

Ah, very interesting, definitely adding to my knowledge in this area.  
Interesting that the 5 second vs 30 second made such a big difference, I'd 
expect a high water mark to give more priority to flushing writes as the cache 
use increased.

> Lustre definitely caches data in the pagecache, and as far as I have seen

I found some older discussions on this.  I believe Lustre stores metadata 
(inodes, file names, permissions, directories, timestamps, etc.) in the page 
cache, but not data (the file contents).

> metadata in slab.  I'd start by running slabtop on a client machine if you can stably reproduce the OOM situation, or creating a cronjob to cat /proc/meminfo and /proc/vmstat into a file at minutely intervals to try to save state of the machine before it goes belly up.  If you see a tremendous amount consumed by Lustre slabs then it's likely on the inode caching side (the slab name should be indicative though), and you might try a client build with this recent change to see if it mitigates the issue:
> https://review.whamcloud.com/#/c/39973

Ah, thanks, will take a look at that URL as well.

> Note that disabling the Lustre inode cache like this will inherently apply significantly more pressure on your MDTs, but if it keeps you out of OOM territory, it's probably a win.
Well I'm happy to give any fixed amount of memory to Lustre, and we do want to 
offload our MDTs as much as possible, so caching zero metadata isn't an option.  
I just want to be able to say something like 8GB for Lustre (metadata cache, 
data cache, and any write caching/queueing) and 248GB for user jobs.  So far 
I've found a parameter for percent of ram for dirty pages, but no control for 
the rest.  Anything in the page cache should be flushed under memory pressure, 
but from what I can tell any data (not metadata) caching isn't controlled, or at 
least I haven't found the control.
> In my case it wasn't metadata that was forcing my clients to OOM, but PTLRPC holding onto references to pages the rest of Lustre thought it was done with until my OSTs committed their transactions.  Revising my OST mount options to use an explicit commit=5 fixed my problem.

Sounds promising, and certainly that helps, but not quite the max caching I was 
looking for.   Reminds me of the ZFS ZIL/slog, which I believe has a similar 
default 5 second buffering of writes.

Much appreciated, very helpful, and I'll keep digging.