<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

</head>

<body>

Neil,<br>

<br>

My understanding is marking the inode cache reclaimable would make Lustre unusual/unique among Linux file systems.  Is that incorrect?<br>

<br>

- Patrick

<hr style="display:inline-block;width:98%" tabindex="-1">

<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> lustre-discuss <lustre-discuss-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com><br>

<b>Sent:</b> Monday, April 29, 2019 8:53:43 PM<br>

<b>To:</b> Jacek Tomaka<br>

<b>Cc:</b> lustre-discuss@lists.lustre.org<br>

<b>Subject:</b> Re: [lustre-discuss] Lustre client memory and MemoryAvailable</font>

<div> </div>

</div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">

<div class="PlainText">On Mon, Apr 29 2019, Jacek Tomaka wrote:<br>

<br>

>> so lustre_inode_cache is the real culprit when signal_cache appears to<br>

>>  be large.<br>

>> This cache is slaved on the common inode cache, so there should be one<br>

>> entry for each lustre inode that is in memory.<br>

>> These inodes should get pruned when they've been inactive for a while.<br>

><br>

> What triggers the prunning?<br>

><br>

<br>

Memory pressure.<br>

The approx approach is try to free some unused pages and about 1/2000th of<br>

the entries in each slab.  Then if that hasn't made enough space<br>

available, try again.<br>

<br>

>>If you look in /proc/sys/fs/inode-nr  there should be two numbers:<br>

>>  The first is the total number of in-memory inodes for all filesystems.<br>

>>  The second is the number of "unused" inodes.<br>

>><br>

>>  When you write "3" to drop_caches, the second number should drop down to<br>

>> nearly zero (I get 95 on my desktop, down from 6524).<br>

><br>

> Ok, that is useful to know but echoing 3 to drop_cache or generating memory<br>

> pressure<br>

> clears most of the signal_cache (inode) as well as other lustre objects, so<br>

> this is working fine.<br>

<br>

Oh good, I hadn't remembered clearly what the issue was.<br>

<br>

><br>

> The issue that remains is that they are marked as SUnreclaim vs<br>

> SReclaimable.<br>

<br>

Yes, I think lustre_inode_cache should certainly be flagged as<br>

SLAB_RECLAIM_ACCOUNT.<br>

If the SReclaimable value is too small (and there aren't many<br>

reclaimable pagecache pages), vmscan can decide not to bother.  This is<br>

probably a fairly small risk but it is possible that the missing<br>

SLAB_RECLAIM_ACCOUNT flag can result in memory not being reclaimed when<br>

it could be.<br>

<br>

Thanks,<br>

NeilBrown<br>

<br>

<br>

> So i do not think there is a memory leak per se.<br>

><br>

> Regards.<br>

> Jacek Tomaka<br>

><br>

> On Mon, Apr 29, 2019 at 1:39 PM NeilBrown <neilb@suse.com> wrote:<br>

><br>

>><br>

>> Thanks Jacek,<br>

>>  so lustre_inode_cache is the real culprit when signal_cache appears to<br>

>>  be large.<br>

>>  This cache is slaved on the common inode cache, so there should be one<br>

>>  entry for each lustre inode that is in memory.<br>

>>  These inodes should get pruned when they've been inactive for a while.<br>

>><br>

>>  If you look in /proc/sys/fs/inode-nr  there should be two numbers:<br>

>>   The first is the total number of in-memory inodes for all filesystems.<br>

>>   The second is the number of "unused" inodes.<br>

>><br>

>>  When you write "3" to drop_caches, the second number should drop down to<br>

>>  nearly zero (I get 95 on my desktop, down from 6524).<br>

>><br>

>>  When signal_cache stays large even after the drop_caches, it suggest<br>

>>  that there are lots of lustre inodes that are thought to be still<br>

>>  active.   I'd have to do a bit of digging to understand what that means,<br>

>>  and a lot more to work out why lustre is holding on to inodes longer<br>

>>  than you would expect (if that actually is the case).<br>

>><br>

>>  If an inode still has cached data pages attached that cannot easily be<br>

>>  removed, it will not be purged even if it is unused.<br>

>>  So if you see the "unused" number remaining high even after a<br>

>>  "drop_caches", that might mean that lustre isn't letting go of cache<br>

>>  pages for some reason.<br>

>><br>

>> NeilBrown<br>

>><br>

>><br>

>><br>

>> On Mon, Apr 29 2019, Jacek Tomaka wrote:<br>

>><br>

>> > Wow, Thanks Nathan and NeilBrown.<br>

>> > It is great to learn about slub merging. It is awesome to have a<br>

>> > reproducer.<br>

>> > I am yet to trigger my original problem with slurm_nomerge but<br>

>> > slabinfo tool (in kernel sources) can actually show merged caches:<br>

>> > kernel/3.10.0-693.5.2.el7/tools/slabinfo  -a<br>

>> ><br>

>> > :t-0000112   <- sysfs_dir_cache kernfs_node_cache blkdev_integrity<br>

>> > task_delay_info<br>

>> > :t-0000144   <- flow_cache cl_env_kmem<br>

>> > :t-0000160   <- sigqueue lov_object_kmem<br>

>> > :t-0000168   <- lovsub_object_kmem osc_extent_kmem<br>

>> > :t-0000176   <- vvp_object_kmem nfsd4_stateids<br>

>> > :t-0000192   <- ldlm_resources kiocb cred_jar inet_peer_cache key_jar<br>

>> > file_lock_cache kmalloc-192 dmaengine-unmap-16 bio_integrity_payload<br>

>> > :t-0000216   <- vvp_session_kmem vm_area_struct<br>

>> > :t-0000256   <- biovec-16 ip_dst_cache bio-0 ll_file_data kmalloc-256<br>

>> > sgpool-8 filp request_sock_TCP rpc_tasks request_sock_TCPv6<br>

>> > skbuff_head_cache pool_workqueue lov_thread_kmem<br>

>> > :t-0000264   <- osc_lock_kmem numa_policy<br>

>> > :t-0000328   <- osc_session_kmem taskstats<br>

>> > :t-0000576   <- kioctx xfrm_dst_cache vvp_thread_kmem<br>

>> > :t-0001152   <- signal_cache lustre_inode_cache<br>

>> ><br>

>> > It is not on a machine that had the problem i described before but the<br>

>> > kernel version is the same so I am assuming the cache merges are the<br>

>> same.<br>

>> ><br>

>> > Looks like signal_cache points to lustre_inode_cache.<br>

>> > Regards.<br>

>> > Jacek Tomaka<br>

>> ><br>

>> ><br>

>> > On Thu, Apr 25, 2019 at 7:42 AM NeilBrown <neilb@suse.com> wrote:<br>

>> ><br>

>> >><br>

>> >> Hi,<br>

>> >>  you seem to be able to reproduce this fairly easily.<br>

>> >>  If so, could you please boot with the "slub_nomerge" kernel parameter<br>

>> >>  and then reproduce the (apparent) memory leak.<br>

>> >>  I'm hoping that this will show some other slab that is actually using<br>

>> >>  the memory - a slab with very similar object-size to signal_cache that<br>

>> >>  is, by default, being merged with signal_cache.<br>

>> >><br>

>> >> Thanks,<br>

>> >> NeilBrown<br>

>> >><br>

>> >><br>

>> >> On Wed, Apr 24 2019, Nathan Dauchy - NOAA Affiliate wrote:<br>

>> >><br>

>> >> > On Mon, Apr 15, 2019 at 9:18 PM Jacek Tomaka <jacekt@dug.com> wrote:<br>

>> >> ><br>

>> >> >><br>

>> >> >> >signal_cache should have one entry for each process (or<br>

>> thread-group).<br>

>> >> >><br>

>> >> >> That is what i thought as well, looking at the kernel source,<br>

>> >> allocations<br>

>> >> >> from<br>

>> >> >> signal_cache happen only during fork.<br>

>> >> >><br>

>> >> >><br>

>> >> > I was recently chasing an issue with clients suffering from low memory<br>

>> >> and<br>

>> >> > saw that "signal_cache" was a major player.  But the workload on those<br>

>> >> > clients was not doing a lot of forking.  (and I don't *think*<br>

>> threading<br>

>> >> > either)  Rather it was a LOT of metadata read operations.<br>

>> >> ><br>

>> >> > You can see the symptoms by a simple "du" on a Lustre file system:<br>

>> >> ><br>

>> >> > # grep signal_cache /proc/slabinfo<br>

>> >> > signal_cache         967   1092   1152   28    8 : tunables    0    0<br>

>> >> 0<br>

>> >> > : slabdata     39     39      0<br>

>> >> ><br>

>> >> > # du -s /mnt/lfs1/projects/foo<br>

>> >> > 339744908 /mnt/lfs1/projects/foo<br>

>> >> ><br>

>> >> > # grep signal_cache /proc/slabinfo<br>

>> >> > signal_cache      164724 164724   1152   28    8 : tunables    0    0<br>

>> >> 0<br>

>> >> > : slabdata   5883   5883      0<br>

>> >> ><br>

>> >> > # slabtop -s c -o | head -n 20<br>

>> >> >  Active / Total Objects (% used)    : 3660791 / 3662863 (99.9%)<br>

>> >> >  Active / Total Slabs (% used)      : 93019 / 93019 (100.0%)<br>

>> >> >  Active / Total Caches (% used)     : 72 / 107 (67.3%)<br>

>> >> >  Active / Total Size (% used)       : 836474.91K / 837502.16K (99.9%)<br>

>> >> >  Minimum / Average / Maximum Object : 0.01K / 0.23K / 12.75K<br>

>> >> ><br>

>> >> >   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME<br>

>> >> ><br>

>> >> > 164724 164724 100%    1.12K   5883       28    188256K signal_cache<br>

>> >> ><br>

>> >> > 331712 331712 100%    0.50K  10366       32    165856K ldlm_locks<br>

>> >> ><br>

>> >> > 656896 656896 100%    0.12K  20528       32     82112K kmalloc-128<br>

>> >> ><br>

>> >> > 340200 339971  99%    0.19K   8100       42     64800K kmalloc-192<br>

>> >> ><br>

>> >> > 162838 162838 100%    0.30K   6263       26     50104K osc_object_kmem<br>

>> >> ><br>

>> >> > 744192 744192 100%    0.06K  11628       64     46512K kmalloc-64<br>

>> >> ><br>

>> >> > 205128 205128 100%    0.19K   4884       42     39072K dentry<br>

>> >> ><br>

>> >> >   4268   4256  99%    8.00K   1067        4     34144K kmalloc-8192<br>

>> >> ><br>

>> >> > 162978 162978 100%    0.17K   3543       46     28344K vvp_object_kmem<br>

>> >> ><br>

>> >> > 162792 162792 100%    0.16K   6783       24     27132K<br>

>> >> kvm_mmu_page_header<br>

>> >> ><br>

>> >> > 162825 162825 100%    0.16K   6513       25     26052K sigqueue<br>

>> >> ><br>

>> >> >  16368  16368 100%    1.02K    528       31     16896K nfs_inode_cache<br>

>> >> ><br>

>> >> >  20385  20385 100%    0.58K    755       27     12080K inode_cache<br>

>> >> ><br>

>> >> ><br>

>> >> > Repeat that for more (and bigger) directories and slab cache added up<br>

>> to<br>

>> >> > more than half the memory on this 24GB node.<br>

>> >> ><br>

>> >> > This is with CentOS-7.6 and lustre-2.10.5_ddn6.<br>

>> >> ><br>

>> >> > I worked around the problem by tackling the "ldlm_locks" memory usage<br>

>> >> with:<br>

>> >> > # lctl set_param ldlm.namespaces.lfs*.lru_max_age=10000<br>

>> >> ><br>

>> >> > ...but I did not find a way to reduce the "signal_cache".<br>

>> >> ><br>

>> >> > Regards,<br>

>> >> > Nathan<br>

>> >><br>

>> ><br>

>> ><br>

>> > --<br>

>> > *Jacek Tomaka*<br>

>> > Geophysical Software Developer<br>

>> ><br>

>> ><br>

>> ><br>

>> ><br>

>> ><br>

>> ><br>

>> > *DownUnder GeoSolutions*<br>

>> > 76 Kings Park Road<br>

>> > West Perth 6005 WA, Australia<br>

>> > *tel *+61 8 9287 4143 <+61%208%209287%204143><br>

>> > jacekt@dug.com<br>

>> > *www.dug.com <<a href="http://www.dug.com">http://www.dug.com</a>>*<br>

>><br>

><br>

><br>

> -- <br>

> *Jacek Tomaka*<br>

> Geophysical Software Developer<br>

><br>

><br>

><br>

><br>

><br>

><br>

> *DownUnder GeoSolutions*<br>

> 76 Kings Park Road<br>

> West Perth 6005 WA, Australia<br>

> *tel *+61 8 9287 4143 <+61%208%209287%204143><br>

> jacekt@dug.com<br>

> *www.dug.com <<a href="http://www.dug.com">http://www.dug.com</a>>*<br>

</div>

</span></font></div>

</body>

</html>