<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>I did : <br></div><div>for i in {1..100}; do cat /proc/36960/stack >$i; sleep 1; done</div><div>in one bash and on the other one(36960): <br></div><div>time -p echo 3 >/proc/sys/vm/drop_caches</div><div><br></div><div>It took about two minutes, unfortunately most of the time it claims that it was not doing anything kernel side: <br></div><div>[<ffffffffffffffff>] 0xffffffffffffffff</div><div>with the exception of two files, at 32 sec and 73 sec: <br></div><div>root@xxx xxx]# cat 32<br>[<ffffffffc11231db>] cl_sync_file_range+0x2db/0x380 [lustre]<br>[<ffffffffffffffff>] 0xffffffffffffffff<br>[root@xxx xxx]# cat 73<br>[<ffffffffc11231db>] cl_sync_file_range+0x2db/0x380 [lustre]<br>[<ffffffffc11330f6>] ll_delete_inode+0xa6/0x1c0 [lustre]<br>[<ffffffff8121d729>] evict+0xa9/0x180<br>[<ffffffff8121d83e>] dispose_list+0x3e/0x50<br>[<ffffffff8121e834>] prune_icache_sb+0x174/0x340<br>[<ffffffff81203863>] prune_super+0x143/0x170<br>[<ffffffff81195443>] shrink_slab+0x163/0x330<br>[<ffffffff812655f3>] drop_caches_sysctl_handler+0xc3/0x120<br>[<ffffffff8127c203>] proc_sys_call_handler+0xd3/0xf0<br>[<ffffffff8127c234>] proc_sys_write+0x14/0x20<br>[<ffffffff81200cad>] vfs_write+0xbd/0x1e0<br>[<ffffffff81201abf>] SyS_write+0x7f/0xe0<br>[<ffffffff816b5292>] tracesys+0xdd/0xe2<br>[<ffffffffffffffff>] 0xffffffffffffffff</div><div><br></div><div>also after unmounting lustre fs and removing all modules i could relate to lustre i could still see some vvp_object_kmem, is it expected?</div><div>[root@xxx xxx]# rmmod obdclass ptlrpc ksocklnd libcfs lnet lustre fid mdc osc cnetmgc fld lmv lov; <br>rmmod: ERROR: Module obdclass is not currently loaded<br>rmmod: ERROR: Module ptlrpc is not currently loaded<br>rmmod: ERROR: Module ksocklnd is not currently loaded<br>rmmod: ERROR: Module libcfs is not currently loaded<br>rmmod: ERROR: Module lnet is not currently loaded<br>rmmod: ERROR: Module lustre is not currently loaded<br>rmmod: ERROR: Module fid is not currently loaded<br>rmmod: ERROR: Module mdc is not currently loaded<br>rmmod: ERROR: Module osc is not currently loaded<br>rmmod: ERROR: Module cnetmgc is not currently loaded<br>rmmod: ERROR: Module fld is not currently loaded<br>rmmod: ERROR: Module lmv is not currently loaded<br>rmmod: ERROR: Module lov is not currently loaded<br>[root@xxx xxx]# cat /proc/slabinfo |grep vvp<br>vvp_object_kmem 32982 33212 176 46 2 : tunables 0 0 0 : slabdata 722 722 0<br></div><div><br></div><div><br></div><div>Regards.</div><div>Jacek Tomaka<br></div><div><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 16, 2019 at 11:18 AM Jacek Tomaka <<a href="mailto:jacekt@dug.com" target="_blank">jacekt@dug.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"></div><div dir="ltr">>That would be interesting. About a dozen copies of<br>> cat /proc/$PID/stack<br>>taken in quick succession would be best, where $PID is the pid of<br>>the shell process which wrote to drop_caches.</div><div dir="ltr"><br></div><div>Will do later today. I have found a candidate node with the problem, just <br></div><div>need to wait for the current task to finish. <br></div><div><br></div><div>>signal_cache should have one entry for each process (or thread-group).</div><div><br></div><div>That is what i thought as well, looking at the kernel source, allocations from</div><div>signal_cache happen only during fork. <br></div><div><br></div><div>>It holds a the signal_struct structure that is shared among the threads<br>>in a group.<br>>So 3.7 million signal_structs suggests there are 3.7 million processes<br>>on the system. I don't think Linux supports more that 4 million, so<br>>that is one very busy system.</div><div><br></div><div>Not as much. <br></div><div>Top shows: <br></div><div>Tasks: 3048 total, 273 running, 2775 sleeping, 0 stopped, 0 zombie</div><div>slabinfo (note that this is a different node than in my original email). <br></div><div><br></div><div>slabinfo - version: 2.1<br># name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail><br>nfs_direct_cache 0 0 352 46 4 : tunables 0 0 0 : slabdata 0 0 0<br>nfs_commit_data 46 46 704 46 8 : tunables 0 0 0 : slabdata 1 1 0<br>nfs_inode_cache 25110 25110 1048 31 8 : tunables 0 0 0 : slabdata 810 810 0<br>fscache_cookie_jar 552 552 88 46 1 : tunables 0 0 0 : slabdata 12 12 0<br>iser_descriptors 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0<br>t10_alua_lu_gp_cache 40 40 200 40 2 : tunables 0 0 0 : slabdata 1 1 0<br>t10_pr_reg_cache 0 0 696 47 8 : tunables 0 0 0 : slabdata 0 0 0<br>se_sess_cache 10728 10728 896 36 8 : tunables 0 0 0 : slabdata 298 298 0<br>kcopyd_job 0 0 3312 9 8 : tunables 0 0 0 : slabdata 0 0 0<br>dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0<br>dm_rq_target_io 0 0 136 60 2 : tunables 0 0 0 : slabdata 0 0 0<br>nfs4_layout_stateid 0 0 296 55 4 : tunables 0 0 0 : slabdata 0 0 0<br>nfsd4_delegations 0 0 240 68 4 : tunables 0 0 0 : slabdata 0 0 0<br>nfsd4_files 0 0 288 56 4 : tunables 0 0 0 : slabdata 0 0 0<br>nfsd4_lockowners 0 0 400 40 4 : tunables 0 0 0 : slabdata 0 0 0<br>nfsd4_openowners 0 0 440 74 8 : tunables 0 0 0 : slabdata 0 0 0<br>rpc_inode_cache 1122 1122 640 51 8 : tunables 0 0 0 : slabdata 22 22 0<br>vvp_object_kmem 5805496 5819230 176 46 2 : tunables 0 0 0 : slabdata 126505 126505 0<br>ll_thread_kmem 28341 28341 344 47 4 : tunables 0 0 0 : slabdata 603 603 0<br>lov_session_kmem 28636 29370 592 55 8 : tunables 0 0 0 : slabdata 534 534 0<br>osc_extent_kmem 6410367 6423408 168 48 2 : tunables 0 0 0 : slabdata 133821 133821 0<br>osc_thread_kmem 13409 13453 2832 11 8 : tunables 0 0 0 : slabdata 1223 1223 0<br>osc_object_kmem 6401946 6417982 304 53 4 : tunables 0 0 0 : slabdata 121094 121094 0<br>ldlm_locks 120640 120960 512 64 8 : tunables 0 0 0 : slabdata 1890 1890 0<br>ptlrpc_cache 86142 86142 768 42 8 : tunables 0 0 0 : slabdata 2051 2051 0<br>ll_import_cache 0 0 1480 22 8 : tunables 0 0 0 : slabdata 0 0 0<br>ll_obdo_cache 21216 21216 208 78 4 : tunables 0 0 0 : slabdata 272 272 0<br>ll_obd_dev_cache 72 72 3960 8 8 : tunables 0 0 0 : slabdata 9 9 0<br>ext4_groupinfo_4k 240 240 136 60 2 : tunables 0 0 0 : slabdata 4 4 0<br>ext4_inode_cache 74776 78275 1032 31 8 : tunables 0 0 0 : slabdata 2525 2525 0<br>ext4_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0<br>ext4_free_data 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0<br>ext4_allocation_context 17408 17408 128 64 2 : tunables 0 0 0 : slabdata 272 272 0<br>ext4_io_end 15232 15232 72 56 1 : tunables 0 0 0 : slabdata 272 272 0<br>ext4_extent_status 254554 256938 40 102 1 : tunables 0 0 0 : slabdata 2519 2519 0<br>jbd2_journal_handle 0 0 48 85 1 : tunables 0 0 0 : slabdata 0 0 0<br>jbd2_journal_head 0 0 112 73 2 : tunables 0 0 0 : slabdata 0 0 0<br>jbd2_revoke_table_s 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0<br>jbd2_revoke_record_s 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0<br>ip6_dst_cache 2701 2701 448 73 8 : tunables 0 0 0 : slabdata 37 37 0<br>RAWv6 286 286 1216 26 8 : tunables 0 0 0 : slabdata 11 11 0<br>UDPLITEv6 0 0 1216 26 8 : tunables 0 0 0 : slabdata 0 0 0<br>UDPv6 4550 4550 1216 26 8 : tunables 0 0 0 : slabdata 175 175 0<br>tw_sock_TCPv6 64 64 256 64 4 : tunables 0 0 0 : slabdata 1 1 0<br>TCPv6 4050 4050 2176 15 8 : tunables 0 0 0 : slabdata 270 270 0<br>cfq_io_cq 0 0 120 68 2 : tunables 0 0 0 : slabdata 0 0 0<br>cfq_queue 0 0 232 70 4 : tunables 0 0 0 : slabdata 0 0 0<br>bsg_cmd 0 0 312 52 4 : tunables 0 0 0 : slabdata 0 0 0<br>mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0<br>hugetlbfs_inode_cache 71992 79288 608 53 8 : tunables 0 0 0 : slabdata 1496 1496 0<br>dquot 0 0 256 64 4 : tunables 0 0 0 : slabdata 0 0 0<br>userfaultfd_ctx_cache 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0<br>fanotify_event_info 7957 7957 56 73 1 : tunables 0 0 0 : slabdata 109 109 0<br>pid_namespace 0 0 2200 14 8 : tunables 0 0 0 : slabdata 0 0 0<br>posix_timers_cache 17952 17952 248 66 4 : tunables 0 0 0 : slabdata 272 272 0<br>UDP-Lite 0 0 1088 30 8 : tunables 0 0 0 : slabdata 0 0 0<br>flow_cache 33488 33488 144 56 2 : tunables 0 0 0 : slabdata 598 598 0<br>xfrm_dst_cache 29624 29624 576 56 8 : tunables 0 0 0 : slabdata 529 529 0<br>UDP 8190 8190 1088 30 8 : tunables 0 0 0 : slabdata 273 273 0<br>tw_sock_TCP 14656 14656 256 64 4 : tunables 0 0 0 : slabdata 229 229 0<br>TCP 4478 4544 1984 16 8 : tunables 0 0 0 : slabdata 284 284 0<br>inotify_inode_mark 7176 7176 88 46 1 : tunables 0 0 0 : slabdata 156 156 0<br>scsi_data_buffer 0 0 24 170 1 : tunables 0 0 0 : slabdata 0 0 0<br>blkdev_queue 14 14 2256 14 8 : tunables 0 0 0 : slabdata 1 1 0<br>blkdev_ioc 21216 21216 104 78 2 : tunables 0 0 0 : slabdata 272 272 0<br>user_namespace 0 0 480 68 8 : tunables 0 0 0 : slabdata 0 0 0<br>dmaengine-unmap-128 30 30 1088 30 8 : tunables 0 0 0 : slabdata 1 1 0<br>sock_inode_cache 15708 15708 640 51 8 : tunables 0 0 0 : slabdata 308 308 0<br>net_namespace 0 0 5184 6 8 : tunables 0 0 0 : slabdata 0 0 0<br>Acpi-ParseExt 26600 26600 72 56 1 : tunables 0 0 0 : slabdata 475 475 0<br>Acpi-State 510 510 80 51 1 : tunables 0 0 0 : slabdata 10 10 0<br><br></div><div>> Unless... the final "put" of a task_struct happens via call_rcu - so it<br>> can be delayed a while, normally 10s of milliseconds, but it can take<br>> seconds to clear a large backlog.<br>> So if you have lots of processes being created and destroyed very<br>> quickly, then you might get a backlog of task_struct, and the associated<br>> signal_struct, waiting to be destroyed.</div><div><br></div><div>The node from my original mail has been idle for days before i did the</div><div>test described. <br></div><div><br></div><div>>However, if the task_struct slab were particularly big, I suspect you<br>>would have included it in the list of large slabs - but you didn't.<br>>If signal_cache has more active entries than task_struct, then something<br>>has gone seriously wrong somewhere.</div><div><br></div><div>Indeed this is the case. Number of tasks and tasks structs are way smaller <br></div><div>than the number of signal cache structs. <br></div><div><br></div><div>>I doubt this problem is related to lustre.</div><div><br></div><div>Hmm. Interesting. Looks like <span class="gmail-m_8563548572746739392gmail-m_-7485787269030008407m_-1061441820261878509gmail-pl-en">__put_task_struct will call into <span class="gmail-m_8563548572746739392gmail-m_-7485787269030008407m_-1061441820261878509gmail-pl-en">put_signal_struct which <br></span></span></div><div><span class="gmail-m_8563548572746739392gmail-m_-7485787269030008407m_-1061441820261878509gmail-pl-en"><span class="gmail-m_8563548572746739392gmail-m_-7485787269030008407m_-1061441820261878509gmail-pl-en">will not free the signal that is referenced by sth. <br></span></span></div><div><br></div><div>I wonder if this could be related to the log entries we see : <br></div><div>_slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_1772/job_33959278/step_batch: Device or resource busy<br></div></div><div>And we are running in nohz_full, so it is going to be interesting problem to diagnose...</div><div><br></div><div>But this seems to be going off on a tangent. Still, thank you for the useful hints and analysis. <br></div><div><br></div><div>Jacek Tomaka<br></div><div dir="ltr"><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 16, 2019 at 7:17 AM NeilBrown <<a href="mailto:neilb@suse.com" target="_blank">neilb@suse.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Mon, Apr 15 2019, Jacek Tomaka wrote:<br>
<br>
> Thanks Patrick for getting the ball rolling!<br>
><br>
>>1/ w.r.t drop_caches, "2" is *not* "inode and dentry". The '2' bit<br>
>> causes all registered shrinkers to be run, until they report there is<br>
>> nothing left that can be discarded. If this is taking 10 minutes,<br>
>> then it seems likely that some shrinker is either very inefficient, or<br>
>> is reporting that there is more work to be done, when really there<br>
>> isn't.<br>
><br>
> This is pretty common problem on this hardware. KNL's CPU is running<br>
> at ~1.3GHz so anything that is not multi threaded can take a few times more<br>
> than on "normal" XEON. While it would be nice to improve this (by running<br>
> it in mutliple threads),<br>
> this is not the problem here. However i can provide you with kernel call<br>
> stack<br>
> next time i see it if you are interested.<br>
<br>
That would be interesting. About a dozen copies of<br>
cat /proc/$PID/stack<br>
taken in quick succession would be best, where $PID is the pid of<br>
the shell process which wrote to drop_caches.<br>
<br>
><br>
><br>
>> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it<br>
>> reclaims anything that can be reclaimed immediately.<br>
><br>
> Awesome. I would just like to know how much easily available memory<br>
> there is on the system without actually reclaiming it and seeing, ideally<br>
> using<br>
> normal kernel mechanisms but if lustre provides a procfs entry where i can<br>
> get it, it will solve my immediate problem.<br>
><br>
>>4/ Patrick is right that accounting is best-effort. But we do want it<br>
>> to improve.<br>
><br>
> Accounting looks better when Lustre is not involved ;) Seriosly, how<br>
> can i help? Should i raise a bug? Try to provide a patch?<br>
><br>
>>Just last week there was a report<br>
>> <a href="https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/" rel="noreferrer" target="_blank">https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/</a><br>
>> about making slab-allocated objects movable. If/when that gets off<br>
>> the ground, it should help the fragmentation problem, so more of the<br>
>> pages listed as reclaimable should actually be so.<br>
><br>
> This is a very interesting article. While memory fragmentation makes it<br>
> more<br>
> difficult to use huge pages, it is not directly related to the problem of<br>
> lustre kernel<br>
> memory allocation accounting. It will be good to see movable slabs, though.<br>
><br>
> Also i am not sure how the high signal_cache can be explained and if<br>
> anything can be<br>
> done on the Lustre level?<br>
<br>
signal_cache should have one entry for each process (or thread-group).<br>
It holds a the signal_struct structure that is shared among the threads<br>
in a group.<br>
So 3.7 million signal_structs suggests there are 3.7 million processes<br>
on the system. I don't think Linux supports more that 4 million, so<br>
that is one very busy system.<br>
Unless... the final "put" of a task_struct happens via call_rcu - so it<br>
can be delayed a while, normally 10s of milliseconds, but it can take<br>
seconds to clear a large backlog.<br>
So if you have lots of processes being created and destroyed very<br>
quickly, then you might get a backlog of task_struct, and the associated<br>
signal_struct, waiting to be destroyed.<br>
However, if the task_struct slab were particularly big, I suspect you<br>
would have included it in the list of large slabs - but you didn't.<br>
If signal_cache has more active entries than task_struct, then something<br>
has gone seriously wrong somewhere.<br>
<br>
I doubt this problem is related to lustre.<br>
<br>
NeilBrown<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail-m_8563548572746739392gmail-m_-7485787269030008407m_-1061441820261878509gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><span><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><font color="#000000"><font face="arial,helvetica,sans-serif"><b>Jacek Tomaka</b></font></font><br><font color="#000000"><font face="arial,helvetica,sans-serif"><font size="2">Geophysical Software Developer</font></font></font><br></div><font face="arial,helvetica,sans-serif" color="#000000">
</font><div><span lang="EN-US"></span> <span lang="EN-US"><b><br><br></b></span></div><font face="arial,helvetica,sans-serif" color="#000000">
</font><img src="http://drive.google.com/uc?export=view&id=0B4X9ixpc-ZU_NHV0WnluaXp5ZkE"><br><br><span style="color:rgb(102,102,102)"><font size="2"><b>DownUnder GeoSolutions<br><br></b></font></span><div><span style="color:rgb(102,102,102)"></span><span style="color:rgb(102,102,102)">76 Kings Park Road<br></span></div><span style="color:rgb(102,102,102)">West Perth 6005 WA, Australia<br><i><b>tel </b></i><a href="tel:+61%208%209287%204143" value="+61892874143" target="_blank">+61 8 9287 4143</a><br><a href="mailto:jacekt@dug.com" target="_blank">jacekt@dug.com</a><br><b><a href="http://www.dug.com" target="_blank">www.dug.com</a></b></span></div></div></div></div></div></div></span></div></div></div></div></div></div></div></div></div></div>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail-m_8563548572746739392gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><span><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><font color="#000000"><font face="arial,helvetica,sans-serif"><b>Jacek Tomaka</b></font></font><br><font color="#000000"><font face="arial,helvetica,sans-serif"><font size="2">Geophysical Software Developer</font></font></font><br></div><font face="arial,helvetica,sans-serif" color="#000000">
</font><div><span lang="EN-US"></span> <span lang="EN-US"><b><br><br></b></span></div><font face="arial,helvetica,sans-serif" color="#000000">
</font><img src="http://drive.google.com/uc?export=view&id=0B4X9ixpc-ZU_NHV0WnluaXp5ZkE"><br><br><span style="color:rgb(102,102,102)"><font size="2"><b>DownUnder GeoSolutions<br><br></b></font></span><div><span style="color:rgb(102,102,102)"></span><span style="color:rgb(102,102,102)">76 Kings Park Road<br></span></div><span style="color:rgb(102,102,102)">West Perth 6005 WA, Australia<br><i><b>tel </b></i><a href="tel:+61%208%209287%204143" value="+61892874143" target="_blank">+61 8 9287 4143</a><br><a href="mailto:jacekt@dug.com" target="_blank">jacekt@dug.com</a><br><b><a href="http://www.dug.com" target="_blank">www.dug.com</a></b></span></div></div></div></div></div></div></span></div></div></div></div></div></div></div></div></div></div></div></div>