[lustre-discuss] Lustre client memory and MemoryAvailable
Jacek Tomaka
jacekt at dug.com
Mon Apr 15 20:18:17 PDT 2019
>That would be interesting. About a dozen copies of
> cat /proc/$PID/stack
>taken in quick succession would be best, where $PID is the pid of
>the shell process which wrote to drop_caches.
Will do later today. I have found a candidate node with the problem, just
need to wait for the current task to finish.
>signal_cache should have one entry for each process (or thread-group).
That is what i thought as well, looking at the kernel source, allocations
from
signal_cache happen only during fork.
>It holds a the signal_struct structure that is shared among the threads
>in a group.
>So 3.7 million signal_structs suggests there are 3.7 million processes
>on the system. I don't think Linux supports more that 4 million, so
>that is one very busy system.
Not as much.
Top shows:
Tasks: 3048 total, 273 running, 2775 sleeping, 0 stopped, 0 zombie
slabinfo (note that this is a different node than in my original email).
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
<active_slabs> <num_slabs> <sharedavail>
nfs_direct_cache 0 0 352 46 4 : tunables 0 0 0
: slabdata 0 0 0
nfs_commit_data 46 46 704 46 8 : tunables 0 0 0
: slabdata 1 1 0
nfs_inode_cache 25110 25110 1048 31 8 : tunables 0 0 0
: slabdata 810 810 0
fscache_cookie_jar 552 552 88 46 1 : tunables 0 0 0
: slabdata 12 12 0
iser_descriptors 0 0 832 39 8 : tunables 0 0 0
: slabdata 0 0 0
t10_alua_lu_gp_cache 40 40 200 40 2 : tunables 0 0
0 : slabdata 1 1 0
t10_pr_reg_cache 0 0 696 47 8 : tunables 0 0 0
: slabdata 0 0 0
se_sess_cache 10728 10728 896 36 8 : tunables 0 0 0
: slabdata 298 298 0
kcopyd_job 0 0 3312 9 8 : tunables 0 0 0
: slabdata 0 0 0
dm_uevent 0 0 2608 12 8 : tunables 0 0 0
: slabdata 0 0 0
dm_rq_target_io 0 0 136 60 2 : tunables 0 0 0
: slabdata 0 0 0
nfs4_layout_stateid 0 0 296 55 4 : tunables 0 0
0 : slabdata 0 0 0
nfsd4_delegations 0 0 240 68 4 : tunables 0 0 0
: slabdata 0 0 0
nfsd4_files 0 0 288 56 4 : tunables 0 0 0
: slabdata 0 0 0
nfsd4_lockowners 0 0 400 40 4 : tunables 0 0 0
: slabdata 0 0 0
nfsd4_openowners 0 0 440 74 8 : tunables 0 0 0
: slabdata 0 0 0
rpc_inode_cache 1122 1122 640 51 8 : tunables 0 0 0
: slabdata 22 22 0
vvp_object_kmem 5805496 5819230 176 46 2 : tunables 0 0
0 : slabdata 126505 126505 0
ll_thread_kmem 28341 28341 344 47 4 : tunables 0 0 0
: slabdata 603 603 0
lov_session_kmem 28636 29370 592 55 8 : tunables 0 0 0
: slabdata 534 534 0
osc_extent_kmem 6410367 6423408 168 48 2 : tunables 0 0
0 : slabdata 133821 133821 0
osc_thread_kmem 13409 13453 2832 11 8 : tunables 0 0 0
: slabdata 1223 1223 0
osc_object_kmem 6401946 6417982 304 53 4 : tunables 0 0
0 : slabdata 121094 121094 0
ldlm_locks 120640 120960 512 64 8 : tunables 0 0 0
: slabdata 1890 1890 0
ptlrpc_cache 86142 86142 768 42 8 : tunables 0 0 0
: slabdata 2051 2051 0
ll_import_cache 0 0 1480 22 8 : tunables 0 0 0
: slabdata 0 0 0
ll_obdo_cache 21216 21216 208 78 4 : tunables 0 0 0
: slabdata 272 272 0
ll_obd_dev_cache 72 72 3960 8 8 : tunables 0 0 0
: slabdata 9 9 0
ext4_groupinfo_4k 240 240 136 60 2 : tunables 0 0 0
: slabdata 4 4 0
ext4_inode_cache 74776 78275 1032 31 8 : tunables 0 0 0
: slabdata 2525 2525 0
ext4_xattr 0 0 88 46 1 : tunables 0 0 0
: slabdata 0 0 0
ext4_free_data 0 0 64 64 1 : tunables 0 0 0
: slabdata 0 0 0
ext4_allocation_context 17408 17408 128 64 2 : tunables 0
0 0 : slabdata 272 272 0
ext4_io_end 15232 15232 72 56 1 : tunables 0 0 0
: slabdata 272 272 0
ext4_extent_status 254554 256938 40 102 1 : tunables 0 0 0
: slabdata 2519 2519 0
jbd2_journal_handle 0 0 48 85 1 : tunables 0 0
0 : slabdata 0 0 0
jbd2_journal_head 0 0 112 73 2 : tunables 0 0 0
: slabdata 0 0 0
jbd2_revoke_table_s 0 0 16 256 1 : tunables 0 0
0 : slabdata 0 0 0
jbd2_revoke_record_s 0 0 32 128 1 : tunables 0 0
0 : slabdata 0 0 0
ip6_dst_cache 2701 2701 448 73 8 : tunables 0 0 0
: slabdata 37 37 0
RAWv6 286 286 1216 26 8 : tunables 0 0 0
: slabdata 11 11 0
UDPLITEv6 0 0 1216 26 8 : tunables 0 0 0
: slabdata 0 0 0
UDPv6 4550 4550 1216 26 8 : tunables 0 0 0
: slabdata 175 175 0
tw_sock_TCPv6 64 64 256 64 4 : tunables 0 0 0
: slabdata 1 1 0
TCPv6 4050 4050 2176 15 8 : tunables 0 0 0
: slabdata 270 270 0
cfq_io_cq 0 0 120 68 2 : tunables 0 0 0
: slabdata 0 0 0
cfq_queue 0 0 232 70 4 : tunables 0 0 0
: slabdata 0 0 0
bsg_cmd 0 0 312 52 4 : tunables 0 0 0
: slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0
: slabdata 1 1 0
hugetlbfs_inode_cache 71992 79288 608 53 8 : tunables 0
0 0 : slabdata 1496 1496 0
dquot 0 0 256 64 4 : tunables 0 0 0
: slabdata 0 0 0
userfaultfd_ctx_cache 0 0 192 42 2 : tunables 0
0 0 : slabdata 0 0 0
fanotify_event_info 7957 7957 56 73 1 : tunables 0 0
0 : slabdata 109 109 0
pid_namespace 0 0 2200 14 8 : tunables 0 0 0
: slabdata 0 0 0
posix_timers_cache 17952 17952 248 66 4 : tunables 0 0 0
: slabdata 272 272 0
UDP-Lite 0 0 1088 30 8 : tunables 0 0 0
: slabdata 0 0 0
flow_cache 33488 33488 144 56 2 : tunables 0 0 0
: slabdata 598 598 0
xfrm_dst_cache 29624 29624 576 56 8 : tunables 0 0 0
: slabdata 529 529 0
UDP 8190 8190 1088 30 8 : tunables 0 0 0
: slabdata 273 273 0
tw_sock_TCP 14656 14656 256 64 4 : tunables 0 0 0
: slabdata 229 229 0
TCP 4478 4544 1984 16 8 : tunables 0 0 0
: slabdata 284 284 0
inotify_inode_mark 7176 7176 88 46 1 : tunables 0 0 0
: slabdata 156 156 0
scsi_data_buffer 0 0 24 170 1 : tunables 0 0 0
: slabdata 0 0 0
blkdev_queue 14 14 2256 14 8 : tunables 0 0 0
: slabdata 1 1 0
blkdev_ioc 21216 21216 104 78 2 : tunables 0 0 0
: slabdata 272 272 0
user_namespace 0 0 480 68 8 : tunables 0 0 0
: slabdata 0 0 0
dmaengine-unmap-128 30 30 1088 30 8 : tunables 0 0
0 : slabdata 1 1 0
sock_inode_cache 15708 15708 640 51 8 : tunables 0 0 0
: slabdata 308 308 0
net_namespace 0 0 5184 6 8 : tunables 0 0 0
: slabdata 0 0 0
Acpi-ParseExt 26600 26600 72 56 1 : tunables 0 0 0
: slabdata 475 475 0
Acpi-State 510 510 80 51 1 : tunables 0 0 0
: slabdata 10 10 0
> Unless... the final "put" of a task_struct happens via call_rcu - so it
> can be delayed a while, normally 10s of milliseconds, but it can take
> seconds to clear a large backlog.
> So if you have lots of processes being created and destroyed very
> quickly, then you might get a backlog of task_struct, and the associated
> signal_struct, waiting to be destroyed.
The node from my original mail has been idle for days before i did the
test described.
>However, if the task_struct slab were particularly big, I suspect you
>would have included it in the list of large slabs - but you didn't.
>If signal_cache has more active entries than task_struct, then something
>has gone seriously wrong somewhere.
Indeed this is the case. Number of tasks and tasks structs are way smaller
than the number of signal cache structs.
>I doubt this problem is related to lustre.
Hmm. Interesting. Looks like __put_task_struct will call into put_signal_struct
which
will not free the signal that is referenced by sth.
I wonder if this could be related to the log entries we see :
_slurm_cgroup_destroy: problem deleting step cgroup path
/cgroup/freezer/slurm/uid_1772/job_33959278/step_batch: Device or resource
busy
And we are running in nohz_full, so it is going to be interesting problem
to diagnose...
But this seems to be going off on a tangent. Still, thank you for the
useful hints and analysis.
Jacek Tomaka
On Tue, Apr 16, 2019 at 7:17 AM NeilBrown <neilb at suse.com> wrote:
> On Mon, Apr 15 2019, Jacek Tomaka wrote:
>
> > Thanks Patrick for getting the ball rolling!
> >
> >>1/ w.r.t drop_caches, "2" is *not* "inode and dentry". The '2' bit
> >> causes all registered shrinkers to be run, until they report there is
> >> nothing left that can be discarded. If this is taking 10 minutes,
> >> then it seems likely that some shrinker is either very inefficient, or
> >> is reporting that there is more work to be done, when really there
> >> isn't.
> >
> > This is pretty common problem on this hardware. KNL's CPU is running
> > at ~1.3GHz so anything that is not multi threaded can take a few times
> more
> > than on "normal" XEON. While it would be nice to improve this (by running
> > it in mutliple threads),
> > this is not the problem here. However i can provide you with kernel call
> > stack
> > next time i see it if you are interested.
>
> That would be interesting. About a dozen copies of
> cat /proc/$PID/stack
> taken in quick succession would be best, where $PID is the pid of
> the shell process which wrote to drop_caches.
>
> >
> >
> >> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it
> >> reclaims anything that can be reclaimed immediately.
> >
> > Awesome. I would just like to know how much easily available memory
> > there is on the system without actually reclaiming it and seeing, ideally
> > using
> > normal kernel mechanisms but if lustre provides a procfs entry where i
> can
> > get it, it will solve my immediate problem.
> >
> >>4/ Patrick is right that accounting is best-effort. But we do want it
> >> to improve.
> >
> > Accounting looks better when Lustre is not involved ;) Seriosly, how
> > can i help? Should i raise a bug? Try to provide a patch?
> >
> >>Just last week there was a report
> >> https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/
> >> about making slab-allocated objects movable. If/when that gets off
> >> the ground, it should help the fragmentation problem, so more of the
> >> pages listed as reclaimable should actually be so.
> >
> > This is a very interesting article. While memory fragmentation makes it
> > more
> > difficult to use huge pages, it is not directly related to the problem of
> > lustre kernel
> > memory allocation accounting. It will be good to see movable slabs,
> though.
> >
> > Also i am not sure how the high signal_cache can be explained and if
> > anything can be
> > done on the Lustre level?
>
> signal_cache should have one entry for each process (or thread-group).
> It holds a the signal_struct structure that is shared among the threads
> in a group.
> So 3.7 million signal_structs suggests there are 3.7 million processes
> on the system. I don't think Linux supports more that 4 million, so
> that is one very busy system.
> Unless... the final "put" of a task_struct happens via call_rcu - so it
> can be delayed a while, normally 10s of milliseconds, but it can take
> seconds to clear a large backlog.
> So if you have lots of processes being created and destroyed very
> quickly, then you might get a backlog of task_struct, and the associated
> signal_struct, waiting to be destroyed.
> However, if the task_struct slab were particularly big, I suspect you
> would have included it in the list of large slabs - but you didn't.
> If signal_cache has more active entries than task_struct, then something
> has gone seriously wrong somewhere.
>
> I doubt this problem is related to lustre.
>
> NeilBrown
>
--
*Jacek Tomaka*
Geophysical Software Developer
*DownUnder GeoSolutions*
76 Kings Park Road
West Perth 6005 WA, Australia
*tel *+61 8 9287 4143 <+61%208%209287%204143>
jacekt at dug.com
*www.dug.com <http://www.dug.com>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190416/024b555e/attachment-0001.html>
More information about the lustre-discuss
mailing list