[lustre-discuss] Lustre client memory and MemoryAvailable

Mon Apr 15 20:18:17 PDT 2019

>That would be interesting. About a dozen copies of
>  cat /proc/$PID/stack
>taken in quick succession would be best, where $PID is the pid of
>the shell process which wrote to drop_caches.

Will do later today. I have found a candidate node with the problem, just
need to wait for the current task to finish.

>signal_cache should have one entry for each process (or thread-group).

That is what i thought as well, looking at the kernel source, allocations
from
signal_cache happen only during fork.

>It holds a the signal_struct structure that is shared among the threads
>in a group.
>So 3.7 million signal_structs suggests there are 3.7 million processes
>on the system.  I don't think Linux supports more that 4 million, so
>that is one very busy system.

Not as much.
Top shows:
Tasks: 3048 total, 273 running, 2775 sleeping,   0 stopped,   0 zombie
slabinfo (note that this is a different node than in my original email).

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
<active_slabs> <num_slabs> <sharedavail>
nfs_direct_cache       0      0    352   46    4 : tunables    0    0    0
: slabdata      0      0      0
nfs_commit_data       46     46    704   46    8 : tunables    0    0    0
: slabdata      1      1      0
nfs_inode_cache    25110  25110   1048   31    8 : tunables    0    0    0
: slabdata    810    810      0
fscache_cookie_jar    552    552     88   46    1 : tunables    0    0    0
: slabdata     12     12      0
iser_descriptors       0      0    832   39    8 : tunables    0    0    0
: slabdata      0      0      0
t10_alua_lu_gp_cache     40     40    200   40    2 : tunables    0    0
0 : slabdata      1      1      0
t10_pr_reg_cache       0      0    696   47    8 : tunables    0    0    0
: slabdata      0      0      0
se_sess_cache      10728  10728    896   36    8 : tunables    0    0    0
: slabdata    298    298      0
kcopyd_job             0      0   3312    9    8 : tunables    0    0    0
: slabdata      0      0      0
dm_uevent              0      0   2608   12    8 : tunables    0    0    0
: slabdata      0      0      0
dm_rq_target_io        0      0    136   60    2 : tunables    0    0    0
: slabdata      0      0      0
nfs4_layout_stateid      0      0    296   55    4 : tunables    0    0
0 : slabdata      0      0      0
nfsd4_delegations      0      0    240   68    4 : tunables    0    0    0
: slabdata      0      0      0
nfsd4_files            0      0    288   56    4 : tunables    0    0    0
: slabdata      0      0      0
nfsd4_lockowners       0      0    400   40    4 : tunables    0    0    0
: slabdata      0      0      0
nfsd4_openowners       0      0    440   74    8 : tunables    0    0    0
: slabdata      0      0      0
rpc_inode_cache     1122   1122    640   51    8 : tunables    0    0    0
: slabdata     22     22      0
vvp_object_kmem   5805496 5819230    176   46    2 : tunables    0    0
0 : slabdata 126505 126505      0
ll_thread_kmem     28341  28341    344   47    4 : tunables    0    0    0
: slabdata    603    603      0
lov_session_kmem   28636  29370    592   55    8 : tunables    0    0    0
: slabdata    534    534      0
osc_extent_kmem   6410367 6423408    168   48    2 : tunables    0    0
0 : slabdata 133821 133821      0
osc_thread_kmem    13409  13453   2832   11    8 : tunables    0    0    0
: slabdata   1223   1223      0
osc_object_kmem   6401946 6417982    304   53    4 : tunables    0    0
0 : slabdata 121094 121094      0
ldlm_locks        120640 120960    512   64    8 : tunables    0    0    0
: slabdata   1890   1890      0
ptlrpc_cache       86142  86142    768   42    8 : tunables    0    0    0
: slabdata   2051   2051      0
ll_import_cache        0      0   1480   22    8 : tunables    0    0    0
: slabdata      0      0      0
ll_obdo_cache      21216  21216    208   78    4 : tunables    0    0    0
: slabdata    272    272      0
ll_obd_dev_cache      72     72   3960    8    8 : tunables    0    0    0
: slabdata      9      9      0
ext4_groupinfo_4k    240    240    136   60    2 : tunables    0    0    0
: slabdata      4      4      0
ext4_inode_cache   74776  78275   1032   31    8 : tunables    0    0    0
: slabdata   2525   2525      0
ext4_xattr             0      0     88   46    1 : tunables    0    0    0
: slabdata      0      0      0
ext4_free_data         0      0     64   64    1 : tunables    0    0    0
: slabdata      0      0      0
ext4_allocation_context  17408  17408    128   64    2 : tunables    0
0    0 : slabdata    272    272      0
ext4_io_end        15232  15232     72   56    1 : tunables    0    0    0
: slabdata    272    272      0
ext4_extent_status 254554 256938     40  102    1 : tunables    0    0    0
: slabdata   2519   2519      0
jbd2_journal_handle      0      0     48   85    1 : tunables    0    0
0 : slabdata      0      0      0
jbd2_journal_head      0      0    112   73    2 : tunables    0    0    0
: slabdata      0      0      0
jbd2_revoke_table_s      0      0     16  256    1 : tunables    0    0
0 : slabdata      0      0      0
jbd2_revoke_record_s      0      0     32  128    1 : tunables    0    0
0 : slabdata      0      0      0
ip6_dst_cache       2701   2701    448   73    8 : tunables    0    0    0
: slabdata     37     37      0
RAWv6                286    286   1216   26    8 : tunables    0    0    0
: slabdata     11     11      0
UDPLITEv6              0      0   1216   26    8 : tunables    0    0    0
: slabdata      0      0      0
UDPv6               4550   4550   1216   26    8 : tunables    0    0    0
: slabdata    175    175      0
tw_sock_TCPv6         64     64    256   64    4 : tunables    0    0    0
: slabdata      1      1      0
TCPv6               4050   4050   2176   15    8 : tunables    0    0    0
: slabdata    270    270      0
cfq_io_cq              0      0    120   68    2 : tunables    0    0    0
: slabdata      0      0      0
cfq_queue              0      0    232   70    4 : tunables    0    0    0
: slabdata      0      0      0
bsg_cmd                0      0    312   52    4 : tunables    0    0    0
: slabdata      0      0      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0
: slabdata      1      1      0
hugetlbfs_inode_cache  71992  79288    608   53    8 : tunables    0
0    0 : slabdata   1496   1496      0
dquot                  0      0    256   64    4 : tunables    0    0    0
: slabdata      0      0      0
userfaultfd_ctx_cache      0      0    192   42    2 : tunables    0
0    0 : slabdata      0      0      0
fanotify_event_info   7957   7957     56   73    1 : tunables    0    0
0 : slabdata    109    109      0
pid_namespace          0      0   2200   14    8 : tunables    0    0    0
: slabdata      0      0      0
posix_timers_cache  17952  17952    248   66    4 : tunables    0    0    0
: slabdata    272    272      0
UDP-Lite               0      0   1088   30    8 : tunables    0    0    0
: slabdata      0      0      0
flow_cache         33488  33488    144   56    2 : tunables    0    0    0
: slabdata    598    598      0
xfrm_dst_cache     29624  29624    576   56    8 : tunables    0    0    0
: slabdata    529    529      0
UDP                 8190   8190   1088   30    8 : tunables    0    0    0
: slabdata    273    273      0
tw_sock_TCP        14656  14656    256   64    4 : tunables    0    0    0
: slabdata    229    229      0
TCP                 4478   4544   1984   16    8 : tunables    0    0    0
: slabdata    284    284      0
inotify_inode_mark   7176   7176     88   46    1 : tunables    0    0    0
: slabdata    156    156      0
scsi_data_buffer       0      0     24  170    1 : tunables    0    0    0
: slabdata      0      0      0
blkdev_queue          14     14   2256   14    8 : tunables    0    0    0
: slabdata      1      1      0
blkdev_ioc         21216  21216    104   78    2 : tunables    0    0    0
: slabdata    272    272      0
user_namespace         0      0    480   68    8 : tunables    0    0    0
: slabdata      0      0      0
dmaengine-unmap-128     30     30   1088   30    8 : tunables    0    0
0 : slabdata      1      1      0
sock_inode_cache   15708  15708    640   51    8 : tunables    0    0    0
: slabdata    308    308      0
net_namespace          0      0   5184    6    8 : tunables    0    0    0
: slabdata      0      0      0
Acpi-ParseExt      26600  26600     72   56    1 : tunables    0    0    0
: slabdata    475    475      0
Acpi-State           510    510     80   51    1 : tunables    0    0    0
: slabdata     10     10      0

> Unless... the final "put" of a task_struct happens via call_rcu - so it
> can be delayed a while, normally 10s of milliseconds, but it can take
> seconds to clear a large backlog.
> So if you have lots of processes being created and destroyed very
> quickly, then you might get a backlog of task_struct, and the associated
> signal_struct, waiting to be destroyed.

The node from my original mail has been idle for days before i did the
test described.

>However, if the task_struct slab were particularly big, I suspect you
>would have included it in the list of large slabs - but you didn't.
>If signal_cache has more active entries than task_struct, then something
>has gone seriously wrong somewhere.

Indeed this is the case. Number of tasks and tasks structs are way smaller
than the number of signal cache structs.

>I doubt this problem is related to lustre.

Hmm. Interesting. Looks like __put_task_struct will call into put_signal_struct
which
will not free the signal that is referenced by sth.

I wonder if this could be related to the log entries we see :
_slurm_cgroup_destroy: problem deleting step cgroup path
/cgroup/freezer/slurm/uid_1772/job_33959278/step_batch: Device or resource
busy
And we are running in nohz_full, so it is going to be interesting problem
to diagnose...

But this seems to be going off on a tangent. Still, thank you for the
useful hints and analysis.

Jacek Tomaka

On Tue, Apr 16, 2019 at 7:17 AM NeilBrown <neilb at suse.com> wrote:

> On Mon, Apr 15 2019, Jacek Tomaka wrote:
>
> > Thanks Patrick for getting the ball rolling!
> >
> >>1/ w.r.t drop_caches, "2" is *not* "inode and dentry".  The '2' bit
> >>  causes all registered shrinkers to be run, until they report there is
> >>  nothing left that can be discarded.  If this is taking 10 minutes,
> >>  then it seems likely that some shrinker is either very inefficient, or
> >>  is reporting that there is more work to be done, when really there
> >>  isn't.
> >
> > This is pretty common problem on this hardware. KNL's CPU is running
> > at ~1.3GHz so anything that is not multi threaded can take a few times
> more
> > than on "normal" XEON. While it would be nice to improve this (by running
> > it in mutliple threads),
> > this is not the problem here. However i can provide you with kernel call
> > stack
> > next time i see it if you are interested.
>
> That would be interesting. About a dozen copies of
>   cat /proc/$PID/stack
> taken in quick succession would be best, where $PID is the pid of
> the shell process which wrote to drop_caches.
>
> >
> >
> >> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it
> >>   reclaims anything that can be reclaimed immediately.
> >
> > Awesome. I would just like to know how much easily available memory
> > there is on the system without actually reclaiming it and seeing, ideally
> > using
> > normal kernel mechanisms but if lustre provides a procfs entry where i
> can
> > get it, it will solve my immediate problem.
> >
> >>4/ Patrick is right that accounting is best-effort.  But we do want it
> >>  to improve.
> >
> > Accounting looks better when Lustre is not involved ;) Seriosly, how
> > can i help? Should i raise a bug? Try to provide a patch?
> >
> >>Just last week there was a report
> >>  https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/
> >>  about making slab-allocated objects movable.  If/when that gets off
> >>  the ground, it should help the fragmentation problem, so more of the
> >>  pages listed as reclaimable should actually be so.
> >
> > This is a very interesting article. While memory fragmentation makes it
> > more
> > difficult to use huge pages, it is not directly related to the problem of
> > lustre kernel
> > memory allocation accounting. It will be good to see movable slabs,
> though.
> >
> > Also i am not sure how the high signal_cache can be explained and if
> > anything can be
> > done on the Lustre level?
>
> signal_cache should have one entry for each process (or thread-group).
> It holds a the signal_struct structure that is shared among the threads
> in a group.
> So 3.7 million signal_structs suggests there are 3.7 million processes
> on the system.  I don't think Linux supports more that 4 million, so
> that is one very busy system.
> Unless... the final "put" of a task_struct happens via call_rcu - so it
> can be delayed a while, normally 10s of milliseconds, but it can take
> seconds to clear a large backlog.
> So if you have lots of processes being created and destroyed very
> quickly, then you might get a backlog of task_struct, and the associated
> signal_struct, waiting to be destroyed.
> However, if the task_struct slab were particularly big, I suspect you
> would have included it in the list of large slabs - but you didn't.
> If signal_cache has more active entries than task_struct, then something
> has gone seriously wrong somewhere.
>
> I doubt this problem is related to lustre.
>
> NeilBrown
>

-- 
*Jacek Tomaka*
Geophysical Software Developer

*DownUnder GeoSolutions*
76 Kings Park Road
West Perth 6005 WA, Australia
*tel *+61 8 9287 4143 <+61%208%209287%204143>
jacekt at dug.com
*www.dug.com <http://www.dug.com>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190416/024b555e/attachment-0001.html>