[lustre-devel] Memory issues with bulk rpcs and buffered I/O; ptlrpc bug?
elliswilson at microsoft.com
Fri Feb 11 12:37:00 PST 2022
I'm still trying to get to the bottom of recurring OOM issues relating to: https://jira.whamcloud.com/browse/LU-15468 Any help is greatly appreciated, even if it's just "here are docs on bulk rpcs / ptlrpc / etc".
The TL;DR issue is iozone or dd can trivially drive used memory up against system limits, I/O drags to a halt due to inability to allocate new pages, and eventually all of the outstanding pages are marked cleaned and the process repeats. Oom-killer pops in and out for longer runs.
I've been able to reproduce it on the following combinations of (kernel) + (lustre) on Ubuntu 18.04 (client-side):
5.14.11 + 2.14.0
4.15.17 + 2.14.0
4.15.17 + 2.12.8
One of the things I noticed was that when I hit the high-memory pressure condition, kswapd0 kicks in aggressively, and on-cpu graphs show that shrink_inactive_pages is cycling wildly (up to 1M calls per second). Digging further into this, shrink_page_list is being called from that and every time it scans 32 pages, finding none available for reclaim. This is due to the page not having a mapping (it was synced out to the OST long ago), and therefore the vmscan code is treating it almost like an anonymous page (though it's not) -- it's waiting for the last reference to drop it. This behavior is NOT reproduced if you specify ODIRECT.
Tracing this with ftrace, I find:
If using direct I/O, kernel functions and counts per second matching 'get_page' or 'put_page' show up as (two seconds provided for each example):
But if you use buffered I/O, the result is:
Not only are the volumes significantly higher, but the gets to puts are way off. Note: there are the ~55K gets per second as there were for DIO in get_page_from_freelist, PLUS one get per pagecache_get_page in get_page_from_freelist.
So, this isn't 700K newly gotten pages per second, but more like 400K newly gotten pages per second (which matches the line-rate of around 1.6GB/s), 350K of which are referenced twice and end up gunking up the inactive page lru list for vmscan. Some time later you will see a mass put of these pages for non-DIO workloads:
I've tracked these puts to be caused by:
ptlrpcd_00_03-3531  .... 82284.061407: __put_page: (__put_page+0x0/0x80)
ptlrpcd_00_00-3527  .... 82284.061423: __put_page: (__put_page+0x0/0x80)
ptlrpcd_00_00-3527  .... 82284.061434: <stack trace>
The ptlrpcd_check loop basically processes around 2M+ of these in a second and suddenly we go from 15/16GB in Used to 1/16GB in Used, the rest having returned to Free.
Anybody who can shed light on possible causes of the above behavior, provide suggested experiments or tunables to revise, or point me at useful docs on PTLRPC behavior would be greatly appreciated.
I'm baffled that PTLRPC can seemingly hold onto references to an indefinite number of pages -- despite plenty of tunables existing at higher levels to control sizes of various caches, no tunable seems to govern how many refs ptlrpc can hold onto.
More information about the lustre-devel