[lustre-discuss] Lustre client memory and MemoryAvailable

Wed Apr 24 16:42:17 PDT 2019

Hi,
 you seem to be able to reproduce this fairly easily.
 If so, could you please boot with the "slub_nomerge" kernel parameter
 and then reproduce the (apparent) memory leak.
 I'm hoping that this will show some other slab that is actually using
 the memory - a slab with very similar object-size to signal_cache that
 is, by default, being merged with signal_cache.

Thanks,
NeilBrown

On Wed, Apr 24 2019, Nathan Dauchy - NOAA Affiliate wrote:

> On Mon, Apr 15, 2019 at 9:18 PM Jacek Tomaka <jacekt at dug.com> wrote:
>
>>
>> >signal_cache should have one entry for each process (or thread-group).
>>
>> That is what i thought as well, looking at the kernel source, allocations
>> from
>> signal_cache happen only during fork.
>>
>>
> I was recently chasing an issue with clients suffering from low memory and
> saw that "signal_cache" was a major player.  But the workload on those
> clients was not doing a lot of forking.  (and I don't *think* threading
> either)  Rather it was a LOT of metadata read operations.
>
> You can see the symptoms by a simple "du" on a Lustre file system:
>
> # grep signal_cache /proc/slabinfo
> signal_cache         967   1092   1152   28    8 : tunables    0    0    0
> : slabdata     39     39      0
>
> # du -s /mnt/lfs1/projects/foo
> 339744908 /mnt/lfs1/projects/foo
>
> # grep signal_cache /proc/slabinfo
> signal_cache      164724 164724   1152   28    8 : tunables    0    0    0
> : slabdata   5883   5883      0
>
> # slabtop -s c -o | head -n 20
>  Active / Total Objects (% used)    : 3660791 / 3662863 (99.9%)
>  Active / Total Slabs (% used)      : 93019 / 93019 (100.0%)
>  Active / Total Caches (% used)     : 72 / 107 (67.3%)
>  Active / Total Size (% used)       : 836474.91K / 837502.16K (99.9%)
>  Minimum / Average / Maximum Object : 0.01K / 0.23K / 12.75K
>
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>
> 164724 164724 100%    1.12K   5883       28    188256K signal_cache
>
> 331712 331712 100%    0.50K  10366       32    165856K ldlm_locks
>
> 656896 656896 100%    0.12K  20528       32     82112K kmalloc-128
>
> 340200 339971  99%    0.19K   8100       42     64800K kmalloc-192
>
> 162838 162838 100%    0.30K   6263       26     50104K osc_object_kmem
>
> 744192 744192 100%    0.06K  11628       64     46512K kmalloc-64
>
> 205128 205128 100%    0.19K   4884       42     39072K dentry
>
>   4268   4256  99%    8.00K   1067        4     34144K kmalloc-8192
>
> 162978 162978 100%    0.17K   3543       46     28344K vvp_object_kmem
>
> 162792 162792 100%    0.16K   6783       24     27132K kvm_mmu_page_header
>
> 162825 162825 100%    0.16K   6513       25     26052K sigqueue
>
>  16368  16368 100%    1.02K    528       31     16896K nfs_inode_cache
>
>  20385  20385 100%    0.58K    755       27     12080K inode_cache
>
>
> Repeat that for more (and bigger) directories and slab cache added up to
> more than half the memory on this 24GB node.
>
> This is with CentOS-7.6 and lustre-2.10.5_ddn6.
>
> I worked around the problem by tackling the "ldlm_locks" memory usage with:
> # lctl set_param ldlm.namespaces.lfs*.lru_max_age=10000
>
> ...but I did not find a way to reduce the "signal_cache".
>
> Regards,
> Nathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190425/d33b853e/attachment.sig>