[lustre-discuss] Lustre client memory and MemoryAvailable
Jacek Tomaka
jacekt at dug.com
Mon Apr 15 23:48:06 PDT 2019
I did :
for i in {1..100}; do cat /proc/36960/stack >$i; sleep 1; done
in one bash and on the other one(36960):
time -p echo 3 >/proc/sys/vm/drop_caches
It took about two minutes, unfortunately most of the time it claims that it
was not doing anything kernel side:
[<ffffffffffffffff>] 0xffffffffffffffff
with the exception of two files, at 32 sec and 73 sec:
root at xxx xxx]# cat 32
[<ffffffffc11231db>] cl_sync_file_range+0x2db/0x380 [lustre]
[<ffffffffffffffff>] 0xffffffffffffffff
[root at xxx xxx]# cat 73
[<ffffffffc11231db>] cl_sync_file_range+0x2db/0x380 [lustre]
[<ffffffffc11330f6>] ll_delete_inode+0xa6/0x1c0 [lustre]
[<ffffffff8121d729>] evict+0xa9/0x180
[<ffffffff8121d83e>] dispose_list+0x3e/0x50
[<ffffffff8121e834>] prune_icache_sb+0x174/0x340
[<ffffffff81203863>] prune_super+0x143/0x170
[<ffffffff81195443>] shrink_slab+0x163/0x330
[<ffffffff812655f3>] drop_caches_sysctl_handler+0xc3/0x120
[<ffffffff8127c203>] proc_sys_call_handler+0xd3/0xf0
[<ffffffff8127c234>] proc_sys_write+0x14/0x20
[<ffffffff81200cad>] vfs_write+0xbd/0x1e0
[<ffffffff81201abf>] SyS_write+0x7f/0xe0
[<ffffffff816b5292>] tracesys+0xdd/0xe2
[<ffffffffffffffff>] 0xffffffffffffffff
also after unmounting lustre fs and removing all modules i could relate to
lustre i could still see some vvp_object_kmem, is it expected?
[root at xxx xxx]# rmmod obdclass ptlrpc ksocklnd libcfs lnet lustre fid mdc
osc cnetmgc fld lmv lov;
rmmod: ERROR: Module obdclass is not currently loaded
rmmod: ERROR: Module ptlrpc is not currently loaded
rmmod: ERROR: Module ksocklnd is not currently loaded
rmmod: ERROR: Module libcfs is not currently loaded
rmmod: ERROR: Module lnet is not currently loaded
rmmod: ERROR: Module lustre is not currently loaded
rmmod: ERROR: Module fid is not currently loaded
rmmod: ERROR: Module mdc is not currently loaded
rmmod: ERROR: Module osc is not currently loaded
rmmod: ERROR: Module cnetmgc is not currently loaded
rmmod: ERROR: Module fld is not currently loaded
rmmod: ERROR: Module lmv is not currently loaded
rmmod: ERROR: Module lov is not currently loaded
[root at xxx xxx]# cat /proc/slabinfo |grep vvp
vvp_object_kmem 32982 33212 176 46 2 : tunables 0 0 0
: slabdata 722 722 0
Regards.
Jacek Tomaka
On Tue, Apr 16, 2019 at 11:18 AM Jacek Tomaka <jacekt at dug.com> wrote:
> >That would be interesting. About a dozen copies of
> > cat /proc/$PID/stack
> >taken in quick succession would be best, where $PID is the pid of
> >the shell process which wrote to drop_caches.
>
> Will do later today. I have found a candidate node with the problem, just
> need to wait for the current task to finish.
>
> >signal_cache should have one entry for each process (or thread-group).
>
> That is what i thought as well, looking at the kernel source, allocations
> from
> signal_cache happen only during fork.
>
> >It holds a the signal_struct structure that is shared among the threads
> >in a group.
> >So 3.7 million signal_structs suggests there are 3.7 million processes
> >on the system. I don't think Linux supports more that 4 million, so
> >that is one very busy system.
>
> Not as much.
> Top shows:
> Tasks: 3048 total, 273 running, 2775 sleeping, 0 stopped, 0 zombie
> slabinfo (note that this is a different node than in my original email).
>
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
> <active_slabs> <num_slabs> <sharedavail>
> nfs_direct_cache 0 0 352 46 4 : tunables 0 0 0
> : slabdata 0 0 0
> nfs_commit_data 46 46 704 46 8 : tunables 0 0 0
> : slabdata 1 1 0
> nfs_inode_cache 25110 25110 1048 31 8 : tunables 0 0 0
> : slabdata 810 810 0
> fscache_cookie_jar 552 552 88 46 1 : tunables 0 0
> 0 : slabdata 12 12 0
> iser_descriptors 0 0 832 39 8 : tunables 0 0 0
> : slabdata 0 0 0
> t10_alua_lu_gp_cache 40 40 200 40 2 : tunables 0
> 0 0 : slabdata 1 1 0
> t10_pr_reg_cache 0 0 696 47 8 : tunables 0 0 0
> : slabdata 0 0 0
> se_sess_cache 10728 10728 896 36 8 : tunables 0 0 0
> : slabdata 298 298 0
> kcopyd_job 0 0 3312 9 8 : tunables 0 0 0
> : slabdata 0 0 0
> dm_uevent 0 0 2608 12 8 : tunables 0 0 0
> : slabdata 0 0 0
> dm_rq_target_io 0 0 136 60 2 : tunables 0 0 0
> : slabdata 0 0 0
> nfs4_layout_stateid 0 0 296 55 4 : tunables 0 0
> 0 : slabdata 0 0 0
> nfsd4_delegations 0 0 240 68 4 : tunables 0 0 0
> : slabdata 0 0 0
> nfsd4_files 0 0 288 56 4 : tunables 0 0 0
> : slabdata 0 0 0
> nfsd4_lockowners 0 0 400 40 4 : tunables 0 0 0
> : slabdata 0 0 0
> nfsd4_openowners 0 0 440 74 8 : tunables 0 0 0
> : slabdata 0 0 0
> rpc_inode_cache 1122 1122 640 51 8 : tunables 0 0 0
> : slabdata 22 22 0
> vvp_object_kmem 5805496 5819230 176 46 2 : tunables 0 0
> 0 : slabdata 126505 126505 0
> ll_thread_kmem 28341 28341 344 47 4 : tunables 0 0 0
> : slabdata 603 603 0
> lov_session_kmem 28636 29370 592 55 8 : tunables 0 0 0
> : slabdata 534 534 0
> osc_extent_kmem 6410367 6423408 168 48 2 : tunables 0 0
> 0 : slabdata 133821 133821 0
> osc_thread_kmem 13409 13453 2832 11 8 : tunables 0 0 0
> : slabdata 1223 1223 0
> osc_object_kmem 6401946 6417982 304 53 4 : tunables 0 0
> 0 : slabdata 121094 121094 0
> ldlm_locks 120640 120960 512 64 8 : tunables 0 0 0
> : slabdata 1890 1890 0
> ptlrpc_cache 86142 86142 768 42 8 : tunables 0 0 0
> : slabdata 2051 2051 0
> ll_import_cache 0 0 1480 22 8 : tunables 0 0 0
> : slabdata 0 0 0
> ll_obdo_cache 21216 21216 208 78 4 : tunables 0 0 0
> : slabdata 272 272 0
> ll_obd_dev_cache 72 72 3960 8 8 : tunables 0 0 0
> : slabdata 9 9 0
> ext4_groupinfo_4k 240 240 136 60 2 : tunables 0 0 0
> : slabdata 4 4 0
> ext4_inode_cache 74776 78275 1032 31 8 : tunables 0 0 0
> : slabdata 2525 2525 0
> ext4_xattr 0 0 88 46 1 : tunables 0 0 0
> : slabdata 0 0 0
> ext4_free_data 0 0 64 64 1 : tunables 0 0 0
> : slabdata 0 0 0
> ext4_allocation_context 17408 17408 128 64 2 : tunables 0
> 0 0 : slabdata 272 272 0
> ext4_io_end 15232 15232 72 56 1 : tunables 0 0 0
> : slabdata 272 272 0
> ext4_extent_status 254554 256938 40 102 1 : tunables 0 0
> 0 : slabdata 2519 2519 0
> jbd2_journal_handle 0 0 48 85 1 : tunables 0 0
> 0 : slabdata 0 0 0
> jbd2_journal_head 0 0 112 73 2 : tunables 0 0 0
> : slabdata 0 0 0
> jbd2_revoke_table_s 0 0 16 256 1 : tunables 0 0
> 0 : slabdata 0 0 0
> jbd2_revoke_record_s 0 0 32 128 1 : tunables 0
> 0 0 : slabdata 0 0 0
> ip6_dst_cache 2701 2701 448 73 8 : tunables 0 0 0
> : slabdata 37 37 0
> RAWv6 286 286 1216 26 8 : tunables 0 0 0
> : slabdata 11 11 0
> UDPLITEv6 0 0 1216 26 8 : tunables 0 0 0
> : slabdata 0 0 0
> UDPv6 4550 4550 1216 26 8 : tunables 0 0 0
> : slabdata 175 175 0
> tw_sock_TCPv6 64 64 256 64 4 : tunables 0 0 0
> : slabdata 1 1 0
> TCPv6 4050 4050 2176 15 8 : tunables 0 0 0
> : slabdata 270 270 0
> cfq_io_cq 0 0 120 68 2 : tunables 0 0 0
> : slabdata 0 0 0
> cfq_queue 0 0 232 70 4 : tunables 0 0 0
> : slabdata 0 0 0
> bsg_cmd 0 0 312 52 4 : tunables 0 0 0
> : slabdata 0 0 0
> mqueue_inode_cache 36 36 896 36 8 : tunables 0 0
> 0 : slabdata 1 1 0
> hugetlbfs_inode_cache 71992 79288 608 53 8 : tunables 0
> 0 0 : slabdata 1496 1496 0
> dquot 0 0 256 64 4 : tunables 0 0 0
> : slabdata 0 0 0
> userfaultfd_ctx_cache 0 0 192 42 2 : tunables 0
> 0 0 : slabdata 0 0 0
> fanotify_event_info 7957 7957 56 73 1 : tunables 0 0
> 0 : slabdata 109 109 0
> pid_namespace 0 0 2200 14 8 : tunables 0 0 0
> : slabdata 0 0 0
> posix_timers_cache 17952 17952 248 66 4 : tunables 0 0
> 0 : slabdata 272 272 0
> UDP-Lite 0 0 1088 30 8 : tunables 0 0 0
> : slabdata 0 0 0
> flow_cache 33488 33488 144 56 2 : tunables 0 0 0
> : slabdata 598 598 0
> xfrm_dst_cache 29624 29624 576 56 8 : tunables 0 0 0
> : slabdata 529 529 0
> UDP 8190 8190 1088 30 8 : tunables 0 0 0
> : slabdata 273 273 0
> tw_sock_TCP 14656 14656 256 64 4 : tunables 0 0 0
> : slabdata 229 229 0
> TCP 4478 4544 1984 16 8 : tunables 0 0 0
> : slabdata 284 284 0
> inotify_inode_mark 7176 7176 88 46 1 : tunables 0 0
> 0 : slabdata 156 156 0
> scsi_data_buffer 0 0 24 170 1 : tunables 0 0 0
> : slabdata 0 0 0
> blkdev_queue 14 14 2256 14 8 : tunables 0 0 0
> : slabdata 1 1 0
> blkdev_ioc 21216 21216 104 78 2 : tunables 0 0 0
> : slabdata 272 272 0
> user_namespace 0 0 480 68 8 : tunables 0 0 0
> : slabdata 0 0 0
> dmaengine-unmap-128 30 30 1088 30 8 : tunables 0 0
> 0 : slabdata 1 1 0
> sock_inode_cache 15708 15708 640 51 8 : tunables 0 0 0
> : slabdata 308 308 0
> net_namespace 0 0 5184 6 8 : tunables 0 0 0
> : slabdata 0 0 0
> Acpi-ParseExt 26600 26600 72 56 1 : tunables 0 0 0
> : slabdata 475 475 0
> Acpi-State 510 510 80 51 1 : tunables 0 0 0
> : slabdata 10 10 0
>
> > Unless... the final "put" of a task_struct happens via call_rcu - so it
> > can be delayed a while, normally 10s of milliseconds, but it can take
> > seconds to clear a large backlog.
> > So if you have lots of processes being created and destroyed very
> > quickly, then you might get a backlog of task_struct, and the associated
> > signal_struct, waiting to be destroyed.
>
> The node from my original mail has been idle for days before i did the
> test described.
>
> >However, if the task_struct slab were particularly big, I suspect you
> >would have included it in the list of large slabs - but you didn't.
> >If signal_cache has more active entries than task_struct, then something
> >has gone seriously wrong somewhere.
>
> Indeed this is the case. Number of tasks and tasks structs are way smaller
> than the number of signal cache structs.
>
> >I doubt this problem is related to lustre.
>
> Hmm. Interesting. Looks like __put_task_struct will call into put_signal_struct
> which
> will not free the signal that is referenced by sth.
>
> I wonder if this could be related to the log entries we see :
> _slurm_cgroup_destroy: problem deleting step cgroup path
> /cgroup/freezer/slurm/uid_1772/job_33959278/step_batch: Device or resource
> busy
> And we are running in nohz_full, so it is going to be interesting problem
> to diagnose...
>
> But this seems to be going off on a tangent. Still, thank you for the
> useful hints and analysis.
>
> Jacek Tomaka
>
> On Tue, Apr 16, 2019 at 7:17 AM NeilBrown <neilb at suse.com> wrote:
>
>> On Mon, Apr 15 2019, Jacek Tomaka wrote:
>>
>> > Thanks Patrick for getting the ball rolling!
>> >
>> >>1/ w.r.t drop_caches, "2" is *not* "inode and dentry". The '2' bit
>> >> causes all registered shrinkers to be run, until they report there is
>> >> nothing left that can be discarded. If this is taking 10 minutes,
>> >> then it seems likely that some shrinker is either very inefficient, or
>> >> is reporting that there is more work to be done, when really there
>> >> isn't.
>> >
>> > This is pretty common problem on this hardware. KNL's CPU is running
>> > at ~1.3GHz so anything that is not multi threaded can take a few times
>> more
>> > than on "normal" XEON. While it would be nice to improve this (by
>> running
>> > it in mutliple threads),
>> > this is not the problem here. However i can provide you with kernel call
>> > stack
>> > next time i see it if you are interested.
>>
>> That would be interesting. About a dozen copies of
>> cat /proc/$PID/stack
>> taken in quick succession would be best, where $PID is the pid of
>> the shell process which wrote to drop_caches.
>>
>> >
>> >
>> >> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it
>> >> reclaims anything that can be reclaimed immediately.
>> >
>> > Awesome. I would just like to know how much easily available memory
>> > there is on the system without actually reclaiming it and seeing,
>> ideally
>> > using
>> > normal kernel mechanisms but if lustre provides a procfs entry where i
>> can
>> > get it, it will solve my immediate problem.
>> >
>> >>4/ Patrick is right that accounting is best-effort. But we do want it
>> >> to improve.
>> >
>> > Accounting looks better when Lustre is not involved ;) Seriosly, how
>> > can i help? Should i raise a bug? Try to provide a patch?
>> >
>> >>Just last week there was a report
>> >> https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/
>> >> about making slab-allocated objects movable. If/when that gets off
>> >> the ground, it should help the fragmentation problem, so more of the
>> >> pages listed as reclaimable should actually be so.
>> >
>> > This is a very interesting article. While memory fragmentation makes it
>> > more
>> > difficult to use huge pages, it is not directly related to the problem
>> of
>> > lustre kernel
>> > memory allocation accounting. It will be good to see movable slabs,
>> though.
>> >
>> > Also i am not sure how the high signal_cache can be explained and if
>> > anything can be
>> > done on the Lustre level?
>>
>> signal_cache should have one entry for each process (or thread-group).
>> It holds a the signal_struct structure that is shared among the threads
>> in a group.
>> So 3.7 million signal_structs suggests there are 3.7 million processes
>> on the system. I don't think Linux supports more that 4 million, so
>> that is one very busy system.
>> Unless... the final "put" of a task_struct happens via call_rcu - so it
>> can be delayed a while, normally 10s of milliseconds, but it can take
>> seconds to clear a large backlog.
>> So if you have lots of processes being created and destroyed very
>> quickly, then you might get a backlog of task_struct, and the associated
>> signal_struct, waiting to be destroyed.
>> However, if the task_struct slab were particularly big, I suspect you
>> would have included it in the list of large slabs - but you didn't.
>> If signal_cache has more active entries than task_struct, then something
>> has gone seriously wrong somewhere.
>>
>> I doubt this problem is related to lustre.
>>
>> NeilBrown
>>
>
>
> --
> *Jacek Tomaka*
> Geophysical Software Developer
>
>
>
>
>
>
> *DownUnder GeoSolutions*
> 76 Kings Park Road
> West Perth 6005 WA, Australia
> *tel *+61 8 9287 4143 <+61%208%209287%204143>
> jacekt at dug.com
> *www.dug.com <http://www.dug.com>*
>
--
*Jacek Tomaka*
Geophysical Software Developer
*DownUnder GeoSolutions*
76 Kings Park Road
West Perth 6005 WA, Australia
*tel *+61 8 9287 4143 <+61%208%209287%204143>
jacekt at dug.com
*www.dug.com <http://www.dug.com>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190416/4e61e802/attachment-0001.html>
More information about the lustre-discuss
mailing list