[lustre-discuss] Lustre client memory and MemoryAvailable
NeilBrown
neilb at suse.com
Mon Apr 15 16:17:33 PDT 2019
On Mon, Apr 15 2019, Jacek Tomaka wrote:
> Thanks Patrick for getting the ball rolling!
>
>>1/ w.r.t drop_caches, "2" is *not* "inode and dentry". The '2' bit
>> causes all registered shrinkers to be run, until they report there is
>> nothing left that can be discarded. If this is taking 10 minutes,
>> then it seems likely that some shrinker is either very inefficient, or
>> is reporting that there is more work to be done, when really there
>> isn't.
>
> This is pretty common problem on this hardware. KNL's CPU is running
> at ~1.3GHz so anything that is not multi threaded can take a few times more
> than on "normal" XEON. While it would be nice to improve this (by running
> it in mutliple threads),
> this is not the problem here. However i can provide you with kernel call
> stack
> next time i see it if you are interested.
That would be interesting. About a dozen copies of
cat /proc/$PID/stack
taken in quick succession would be best, where $PID is the pid of
the shell process which wrote to drop_caches.
>
>
>> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it
>> reclaims anything that can be reclaimed immediately.
>
> Awesome. I would just like to know how much easily available memory
> there is on the system without actually reclaiming it and seeing, ideally
> using
> normal kernel mechanisms but if lustre provides a procfs entry where i can
> get it, it will solve my immediate problem.
>
>>4/ Patrick is right that accounting is best-effort. But we do want it
>> to improve.
>
> Accounting looks better when Lustre is not involved ;) Seriosly, how
> can i help? Should i raise a bug? Try to provide a patch?
>
>>Just last week there was a report
>> https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/
>> about making slab-allocated objects movable. If/when that gets off
>> the ground, it should help the fragmentation problem, so more of the
>> pages listed as reclaimable should actually be so.
>
> This is a very interesting article. While memory fragmentation makes it
> more
> difficult to use huge pages, it is not directly related to the problem of
> lustre kernel
> memory allocation accounting. It will be good to see movable slabs, though.
>
> Also i am not sure how the high signal_cache can be explained and if
> anything can be
> done on the Lustre level?
signal_cache should have one entry for each process (or thread-group).
It holds a the signal_struct structure that is shared among the threads
in a group.
So 3.7 million signal_structs suggests there are 3.7 million processes
on the system. I don't think Linux supports more that 4 million, so
that is one very busy system.
Unless... the final "put" of a task_struct happens via call_rcu - so it
can be delayed a while, normally 10s of milliseconds, but it can take
seconds to clear a large backlog.
So if you have lots of processes being created and destroyed very
quickly, then you might get a backlog of task_struct, and the associated
signal_struct, waiting to be destroyed.
However, if the task_struct slab were particularly big, I suspect you
would have included it in the list of large slabs - but you didn't.
If signal_cache has more active entries than task_struct, then something
has gone seriously wrong somewhere.
I doubt this problem is related to lustre.
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190416/92bcc240/attachment.sig>
More information about the lustre-discuss
mailing list