[lustre-discuss] Lustre client memory and MemoryAvailable

Mon Apr 15 16:17:33 PDT 2019

On Mon, Apr 15 2019, Jacek Tomaka wrote:

> Thanks Patrick for getting the ball rolling!
>
>>1/ w.r.t drop_caches, "2" is *not* "inode and dentry".  The '2' bit
>>  causes all registered shrinkers to be run, until they report there is
>>  nothing left that can be discarded.  If this is taking 10 minutes,
>>  then it seems likely that some shrinker is either very inefficient, or
>>  is reporting that there is more work to be done, when really there
>>  isn't.
>
> This is pretty common problem on this hardware. KNL's CPU is running
> at ~1.3GHz so anything that is not multi threaded can take a few times more
> than on "normal" XEON. While it would be nice to improve this (by running
> it in mutliple threads),
> this is not the problem here. However i can provide you with kernel call
> stack
> next time i see it if you are interested.

That would be interesting. About a dozen copies of
  cat /proc/$PID/stack
taken in quick succession would be best, where $PID is the pid of
the shell process which wrote to drop_caches.

>
>
>> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it
>>   reclaims anything that can be reclaimed immediately.
>
> Awesome. I would just like to know how much easily available memory
> there is on the system without actually reclaiming it and seeing, ideally
> using
> normal kernel mechanisms but if lustre provides a procfs entry where i can
> get it, it will solve my immediate problem.
>
>>4/ Patrick is right that accounting is best-effort.  But we do want it
>>  to improve.
>
> Accounting looks better when Lustre is not involved ;) Seriosly, how
> can i help? Should i raise a bug? Try to provide a patch?
>
>>Just last week there was a report
>>  https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/
>>  about making slab-allocated objects movable.  If/when that gets off
>>  the ground, it should help the fragmentation problem, so more of the
>>  pages listed as reclaimable should actually be so.
>
> This is a very interesting article. While memory fragmentation makes it
> more
> difficult to use huge pages, it is not directly related to the problem of
> lustre kernel
> memory allocation accounting. It will be good to see movable slabs, though.
>
> Also i am not sure how the high signal_cache can be explained and if
> anything can be
> done on the Lustre level?

signal_cache should have one entry for each process (or thread-group).
It holds a the signal_struct structure that is shared among the threads
in a group.
So 3.7 million signal_structs suggests there are 3.7 million processes
on the system.  I don't think Linux supports more that 4 million, so
that is one very busy system.
Unless... the final "put" of a task_struct happens via call_rcu - so it
can be delayed a while, normally 10s of milliseconds, but it can take
seconds to clear a large backlog.
So if you have lots of processes being created and destroyed very
quickly, then you might get a backlog of task_struct, and the associated
signal_struct, waiting to be destroyed.
However, if the task_struct slab were particularly big, I suspect you
would have included it in the list of large slabs - but you didn't.
If signal_cache has more active entries than task_struct, then something
has gone seriously wrong somewhere.

I doubt this problem is related to lustre.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190416/92bcc240/attachment.sig>