[lustre-discuss] Lustre client memory and MemoryAvailable

Jacek Tomaka jacekt at dug.com
Mon Apr 15 23:48:06 PDT 2019


I did :
for i in {1..100}; do cat /proc/36960/stack >$i; sleep 1; done
in one bash and on the other one(36960):
time -p echo 3 >/proc/sys/vm/drop_caches

It took about two minutes, unfortunately most of the time it claims that it
was not doing anything kernel side:
[<ffffffffffffffff>] 0xffffffffffffffff
with the exception of two files, at 32 sec and 73 sec:
root at xxx xxx]# cat 32
[<ffffffffc11231db>] cl_sync_file_range+0x2db/0x380 [lustre]
[<ffffffffffffffff>] 0xffffffffffffffff
[root at xxx xxx]# cat 73
[<ffffffffc11231db>] cl_sync_file_range+0x2db/0x380 [lustre]
[<ffffffffc11330f6>] ll_delete_inode+0xa6/0x1c0 [lustre]
[<ffffffff8121d729>] evict+0xa9/0x180
[<ffffffff8121d83e>] dispose_list+0x3e/0x50
[<ffffffff8121e834>] prune_icache_sb+0x174/0x340
[<ffffffff81203863>] prune_super+0x143/0x170
[<ffffffff81195443>] shrink_slab+0x163/0x330
[<ffffffff812655f3>] drop_caches_sysctl_handler+0xc3/0x120
[<ffffffff8127c203>] proc_sys_call_handler+0xd3/0xf0
[<ffffffff8127c234>] proc_sys_write+0x14/0x20
[<ffffffff81200cad>] vfs_write+0xbd/0x1e0
[<ffffffff81201abf>] SyS_write+0x7f/0xe0
[<ffffffff816b5292>] tracesys+0xdd/0xe2
[<ffffffffffffffff>] 0xffffffffffffffff

also after unmounting lustre fs and removing all modules i could relate to
lustre i could still see some vvp_object_kmem, is it expected?
[root at xxx xxx]# rmmod obdclass ptlrpc ksocklnd libcfs lnet lustre fid mdc
osc cnetmgc fld lmv lov;
rmmod: ERROR: Module obdclass is not currently loaded
rmmod: ERROR: Module ptlrpc is not currently loaded
rmmod: ERROR: Module ksocklnd is not currently loaded
rmmod: ERROR: Module libcfs is not currently loaded
rmmod: ERROR: Module lnet is not currently loaded
rmmod: ERROR: Module lustre is not currently loaded
rmmod: ERROR: Module fid is not currently loaded
rmmod: ERROR: Module mdc is not currently loaded
rmmod: ERROR: Module osc is not currently loaded
rmmod: ERROR: Module cnetmgc is not currently loaded
rmmod: ERROR: Module fld is not currently loaded
rmmod: ERROR: Module lmv is not currently loaded
rmmod: ERROR: Module lov is not currently loaded
[root at xxx xxx]# cat /proc/slabinfo |grep vvp
vvp_object_kmem    32982  33212    176   46    2 : tunables    0    0    0
: slabdata    722    722      0


Regards.
Jacek Tomaka

On Tue, Apr 16, 2019 at 11:18 AM Jacek Tomaka <jacekt at dug.com> wrote:

> >That would be interesting. About a dozen copies of
> >  cat /proc/$PID/stack
> >taken in quick succession would be best, where $PID is the pid of
> >the shell process which wrote to drop_caches.
>
> Will do later today. I have found a candidate node with the problem, just
> need to wait for the current task to finish.
>
> >signal_cache should have one entry for each process (or thread-group).
>
> That is what i thought as well, looking at the kernel source, allocations
> from
> signal_cache happen only during fork.
>
> >It holds a the signal_struct structure that is shared among the threads
> >in a group.
> >So 3.7 million signal_structs suggests there are 3.7 million processes
> >on the system.  I don't think Linux supports more that 4 million, so
> >that is one very busy system.
>
> Not as much.
> Top shows:
> Tasks: 3048 total, 273 running, 2775 sleeping,   0 stopped,   0 zombie
> slabinfo (note that this is a different node than in my original email).
>
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
> <active_slabs> <num_slabs> <sharedavail>
> nfs_direct_cache       0      0    352   46    4 : tunables    0    0    0
> : slabdata      0      0      0
> nfs_commit_data       46     46    704   46    8 : tunables    0    0    0
> : slabdata      1      1      0
> nfs_inode_cache    25110  25110   1048   31    8 : tunables    0    0    0
> : slabdata    810    810      0
> fscache_cookie_jar    552    552     88   46    1 : tunables    0    0
> 0 : slabdata     12     12      0
> iser_descriptors       0      0    832   39    8 : tunables    0    0    0
> : slabdata      0      0      0
> t10_alua_lu_gp_cache     40     40    200   40    2 : tunables    0
> 0    0 : slabdata      1      1      0
> t10_pr_reg_cache       0      0    696   47    8 : tunables    0    0    0
> : slabdata      0      0      0
> se_sess_cache      10728  10728    896   36    8 : tunables    0    0    0
> : slabdata    298    298      0
> kcopyd_job             0      0   3312    9    8 : tunables    0    0    0
> : slabdata      0      0      0
> dm_uevent              0      0   2608   12    8 : tunables    0    0    0
> : slabdata      0      0      0
> dm_rq_target_io        0      0    136   60    2 : tunables    0    0    0
> : slabdata      0      0      0
> nfs4_layout_stateid      0      0    296   55    4 : tunables    0    0
> 0 : slabdata      0      0      0
> nfsd4_delegations      0      0    240   68    4 : tunables    0    0    0
> : slabdata      0      0      0
> nfsd4_files            0      0    288   56    4 : tunables    0    0    0
> : slabdata      0      0      0
> nfsd4_lockowners       0      0    400   40    4 : tunables    0    0    0
> : slabdata      0      0      0
> nfsd4_openowners       0      0    440   74    8 : tunables    0    0    0
> : slabdata      0      0      0
> rpc_inode_cache     1122   1122    640   51    8 : tunables    0    0    0
> : slabdata     22     22      0
> vvp_object_kmem   5805496 5819230    176   46    2 : tunables    0    0
> 0 : slabdata 126505 126505      0
> ll_thread_kmem     28341  28341    344   47    4 : tunables    0    0    0
> : slabdata    603    603      0
> lov_session_kmem   28636  29370    592   55    8 : tunables    0    0    0
> : slabdata    534    534      0
> osc_extent_kmem   6410367 6423408    168   48    2 : tunables    0    0
> 0 : slabdata 133821 133821      0
> osc_thread_kmem    13409  13453   2832   11    8 : tunables    0    0    0
> : slabdata   1223   1223      0
> osc_object_kmem   6401946 6417982    304   53    4 : tunables    0    0
> 0 : slabdata 121094 121094      0
> ldlm_locks        120640 120960    512   64    8 : tunables    0    0    0
> : slabdata   1890   1890      0
> ptlrpc_cache       86142  86142    768   42    8 : tunables    0    0    0
> : slabdata   2051   2051      0
> ll_import_cache        0      0   1480   22    8 : tunables    0    0    0
> : slabdata      0      0      0
> ll_obdo_cache      21216  21216    208   78    4 : tunables    0    0    0
> : slabdata    272    272      0
> ll_obd_dev_cache      72     72   3960    8    8 : tunables    0    0    0
> : slabdata      9      9      0
> ext4_groupinfo_4k    240    240    136   60    2 : tunables    0    0    0
> : slabdata      4      4      0
> ext4_inode_cache   74776  78275   1032   31    8 : tunables    0    0    0
> : slabdata   2525   2525      0
> ext4_xattr             0      0     88   46    1 : tunables    0    0    0
> : slabdata      0      0      0
> ext4_free_data         0      0     64   64    1 : tunables    0    0    0
> : slabdata      0      0      0
> ext4_allocation_context  17408  17408    128   64    2 : tunables    0
> 0    0 : slabdata    272    272      0
> ext4_io_end        15232  15232     72   56    1 : tunables    0    0    0
> : slabdata    272    272      0
> ext4_extent_status 254554 256938     40  102    1 : tunables    0    0
> 0 : slabdata   2519   2519      0
> jbd2_journal_handle      0      0     48   85    1 : tunables    0    0
> 0 : slabdata      0      0      0
> jbd2_journal_head      0      0    112   73    2 : tunables    0    0    0
> : slabdata      0      0      0
> jbd2_revoke_table_s      0      0     16  256    1 : tunables    0    0
> 0 : slabdata      0      0      0
> jbd2_revoke_record_s      0      0     32  128    1 : tunables    0
> 0    0 : slabdata      0      0      0
> ip6_dst_cache       2701   2701    448   73    8 : tunables    0    0    0
> : slabdata     37     37      0
> RAWv6                286    286   1216   26    8 : tunables    0    0    0
> : slabdata     11     11      0
> UDPLITEv6              0      0   1216   26    8 : tunables    0    0    0
> : slabdata      0      0      0
> UDPv6               4550   4550   1216   26    8 : tunables    0    0    0
> : slabdata    175    175      0
> tw_sock_TCPv6         64     64    256   64    4 : tunables    0    0    0
> : slabdata      1      1      0
> TCPv6               4050   4050   2176   15    8 : tunables    0    0    0
> : slabdata    270    270      0
> cfq_io_cq              0      0    120   68    2 : tunables    0    0    0
> : slabdata      0      0      0
> cfq_queue              0      0    232   70    4 : tunables    0    0    0
> : slabdata      0      0      0
> bsg_cmd                0      0    312   52    4 : tunables    0    0    0
> : slabdata      0      0      0
> mqueue_inode_cache     36     36    896   36    8 : tunables    0    0
> 0 : slabdata      1      1      0
> hugetlbfs_inode_cache  71992  79288    608   53    8 : tunables    0
> 0    0 : slabdata   1496   1496      0
> dquot                  0      0    256   64    4 : tunables    0    0    0
> : slabdata      0      0      0
> userfaultfd_ctx_cache      0      0    192   42    2 : tunables    0
> 0    0 : slabdata      0      0      0
> fanotify_event_info   7957   7957     56   73    1 : tunables    0    0
> 0 : slabdata    109    109      0
> pid_namespace          0      0   2200   14    8 : tunables    0    0    0
> : slabdata      0      0      0
> posix_timers_cache  17952  17952    248   66    4 : tunables    0    0
> 0 : slabdata    272    272      0
> UDP-Lite               0      0   1088   30    8 : tunables    0    0    0
> : slabdata      0      0      0
> flow_cache         33488  33488    144   56    2 : tunables    0    0    0
> : slabdata    598    598      0
> xfrm_dst_cache     29624  29624    576   56    8 : tunables    0    0    0
> : slabdata    529    529      0
> UDP                 8190   8190   1088   30    8 : tunables    0    0    0
> : slabdata    273    273      0
> tw_sock_TCP        14656  14656    256   64    4 : tunables    0    0    0
> : slabdata    229    229      0
> TCP                 4478   4544   1984   16    8 : tunables    0    0    0
> : slabdata    284    284      0
> inotify_inode_mark   7176   7176     88   46    1 : tunables    0    0
> 0 : slabdata    156    156      0
> scsi_data_buffer       0      0     24  170    1 : tunables    0    0    0
> : slabdata      0      0      0
> blkdev_queue          14     14   2256   14    8 : tunables    0    0    0
> : slabdata      1      1      0
> blkdev_ioc         21216  21216    104   78    2 : tunables    0    0    0
> : slabdata    272    272      0
> user_namespace         0      0    480   68    8 : tunables    0    0    0
> : slabdata      0      0      0
> dmaengine-unmap-128     30     30   1088   30    8 : tunables    0    0
> 0 : slabdata      1      1      0
> sock_inode_cache   15708  15708    640   51    8 : tunables    0    0    0
> : slabdata    308    308      0
> net_namespace          0      0   5184    6    8 : tunables    0    0    0
> : slabdata      0      0      0
> Acpi-ParseExt      26600  26600     72   56    1 : tunables    0    0    0
> : slabdata    475    475      0
> Acpi-State           510    510     80   51    1 : tunables    0    0    0
> : slabdata     10     10      0
>
> > Unless... the final "put" of a task_struct happens via call_rcu - so it
> > can be delayed a while, normally 10s of milliseconds, but it can take
> > seconds to clear a large backlog.
> > So if you have lots of processes being created and destroyed very
> > quickly, then you might get a backlog of task_struct, and the associated
> > signal_struct, waiting to be destroyed.
>
> The node from my original mail has been idle for days before i did the
> test described.
>
> >However, if the task_struct slab were particularly big, I suspect you
> >would have included it in the list of large slabs - but you didn't.
> >If signal_cache has more active entries than task_struct, then something
> >has gone seriously wrong somewhere.
>
> Indeed this is the case. Number of tasks and tasks structs are way smaller
> than the number of signal cache structs.
>
> >I doubt this problem is related to lustre.
>
> Hmm. Interesting. Looks like __put_task_struct will call into put_signal_struct
> which
> will not free the signal that is referenced by sth.
>
> I wonder if this could be related to the log entries we see :
> _slurm_cgroup_destroy: problem deleting step cgroup path
> /cgroup/freezer/slurm/uid_1772/job_33959278/step_batch: Device or resource
> busy
> And we are running in nohz_full, so it is going to be interesting problem
> to diagnose...
>
> But this seems to be going off on a tangent. Still, thank you for the
> useful hints and analysis.
>
> Jacek Tomaka
>
> On Tue, Apr 16, 2019 at 7:17 AM NeilBrown <neilb at suse.com> wrote:
>
>> On Mon, Apr 15 2019, Jacek Tomaka wrote:
>>
>> > Thanks Patrick for getting the ball rolling!
>> >
>> >>1/ w.r.t drop_caches, "2" is *not* "inode and dentry".  The '2' bit
>> >>  causes all registered shrinkers to be run, until they report there is
>> >>  nothing left that can be discarded.  If this is taking 10 minutes,
>> >>  then it seems likely that some shrinker is either very inefficient, or
>> >>  is reporting that there is more work to be done, when really there
>> >>  isn't.
>> >
>> > This is pretty common problem on this hardware. KNL's CPU is running
>> > at ~1.3GHz so anything that is not multi threaded can take a few times
>> more
>> > than on "normal" XEON. While it would be nice to improve this (by
>> running
>> > it in mutliple threads),
>> > this is not the problem here. However i can provide you with kernel call
>> > stack
>> > next time i see it if you are interested.
>>
>> That would be interesting. About a dozen copies of
>>   cat /proc/$PID/stack
>> taken in quick succession would be best, where $PID is the pid of
>> the shell process which wrote to drop_caches.
>>
>> >
>> >
>> >> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it
>> >>   reclaims anything that can be reclaimed immediately.
>> >
>> > Awesome. I would just like to know how much easily available memory
>> > there is on the system without actually reclaiming it and seeing,
>> ideally
>> > using
>> > normal kernel mechanisms but if lustre provides a procfs entry where i
>> can
>> > get it, it will solve my immediate problem.
>> >
>> >>4/ Patrick is right that accounting is best-effort.  But we do want it
>> >>  to improve.
>> >
>> > Accounting looks better when Lustre is not involved ;) Seriosly, how
>> > can i help? Should i raise a bug? Try to provide a patch?
>> >
>> >>Just last week there was a report
>> >>  https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/
>> >>  about making slab-allocated objects movable.  If/when that gets off
>> >>  the ground, it should help the fragmentation problem, so more of the
>> >>  pages listed as reclaimable should actually be so.
>> >
>> > This is a very interesting article. While memory fragmentation makes it
>> > more
>> > difficult to use huge pages, it is not directly related to the problem
>> of
>> > lustre kernel
>> > memory allocation accounting. It will be good to see movable slabs,
>> though.
>> >
>> > Also i am not sure how the high signal_cache can be explained and if
>> > anything can be
>> > done on the Lustre level?
>>
>> signal_cache should have one entry for each process (or thread-group).
>> It holds a the signal_struct structure that is shared among the threads
>> in a group.
>> So 3.7 million signal_structs suggests there are 3.7 million processes
>> on the system.  I don't think Linux supports more that 4 million, so
>> that is one very busy system.
>> Unless... the final "put" of a task_struct happens via call_rcu - so it
>> can be delayed a while, normally 10s of milliseconds, but it can take
>> seconds to clear a large backlog.
>> So if you have lots of processes being created and destroyed very
>> quickly, then you might get a backlog of task_struct, and the associated
>> signal_struct, waiting to be destroyed.
>> However, if the task_struct slab were particularly big, I suspect you
>> would have included it in the list of large slabs - but you didn't.
>> If signal_cache has more active entries than task_struct, then something
>> has gone seriously wrong somewhere.
>>
>> I doubt this problem is related to lustre.
>>
>> NeilBrown
>>
>
>
> --
> *Jacek Tomaka*
> Geophysical Software Developer
>
>
>
>
>
>
> *DownUnder GeoSolutions*
> 76 Kings Park Road
> West Perth 6005 WA, Australia
> *tel *+61 8 9287 4143 <+61%208%209287%204143>
> jacekt at dug.com
> *www.dug.com <http://www.dug.com>*
>


-- 
*Jacek Tomaka*
Geophysical Software Developer






*DownUnder GeoSolutions*
76 Kings Park Road
West Perth 6005 WA, Australia
*tel *+61 8 9287 4143 <+61%208%209287%204143>
jacekt at dug.com
*www.dug.com <http://www.dug.com>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190416/4e61e802/attachment-0001.html>


More information about the lustre-discuss mailing list