[Lustre-discuss] Lustre Client - Memory Issue
Jagga Soorma
jagga13 at gmail.com
Tue Apr 20 09:08:25 PDT 2010
Hi Andreas,
Thanks for your response. I will try to run the leak-finder script and
hopefully it will point us in the right direction. This only seems to be
happening on some of my clients:
--
client112: ll_obdo_cache 0 0 208 19 1 : tunables
120 60 8 : slabdata 0 0 0
client108: ll_obdo_cache 0 0 208 19 1 : tunables
120 60 8 : slabdata 0 0 0
client110: ll_obdo_cache 0 0 208 19 1 : tunables
120 60 8 : slabdata 0 0 0
client107: ll_obdo_cache 0 0 208 19 1 : tunables
120 60 8 : slabdata 0 0 0
client111: ll_obdo_cache 0 0 208 19 1 : tunables
120 60 8 : slabdata 0 0 0
client109: ll_obdo_cache 0 0 208 19 1 : tunables
120 60 8 : slabdata 0 0 0
client102: ll_obdo_cache 5 38 208 19 1 : tunables
120 60 8 : slabdata 2 2 1
client114: ll_obdo_cache 0 0 208 19 1 : tunables
120 60 8 : slabdata 0 0 0
client105: ll_obdo_cache 0 0 208 19 1 : tunables
120 60 8 : slabdata 0 0 0
client103: ll_obdo_cache 0 0 208 19 1 : tunables
120 60 8 : slabdata 0 0 0
client104: ll_obdo_cache 0 433506280 208 19 1 : tunables
120 60 8 : slabdata 0 22816120 0
client116: ll_obdo_cache 0 457366746 208 19 1 : tunables
120 60 8 : slabdata 0 24071934 0
client113: ll_obdo_cache 0 456778867 208 19 1 : tunables
120 60 8 : slabdata 0 24040993 0
client106: ll_obdo_cache 0 456372267 208 19 1 : tunables
120 60 8 : slabdata 0 24019593 0
client115: ll_obdo_cache 0 449929310 208 19 1 : tunables
120 60 8 : slabdata 0 23680490 0
client101: ll_obdo_cache 0 454318101 208 19 1 : tunables
120 60 8 : slabdata 0 23911479 0
--
Hopefully this should help. Not sure which application might be causing the
leaks. Currently R is the only app that users seem to be using heavily on
these clients. Will let you know what I find.
Thanks again,
-J
On Mon, Apr 19, 2010 at 9:04 PM, Andreas Dilger
<andreas.dilger at oracle.com>wrote:
> On 2010-04-19, at 11:16, Jagga Soorma wrote:
>
>> What is the known problem with the DLM LRU size?
>>
>
> It is mostly a problem on the server, actually.
>
> Here is what my slabinfo/meminfo look like on one of the clients. I
>> don't see anything out of the ordinary:
>>
>> (then again there are no jobs currently running on this system)
>>
>> slabinfo - version: 2.1
>> # name <active_objs> <num_objs> <objsize> <objperslab>
>> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
>> <active_slabs> <num_slabs> <sharedavail>
>>
>
> ll_async_page 326589 328572 320 12 1 : tunables 54 27 8
>> : slabdata 27381 27381 0
>>
>
> This shows you have 326589 pages in the lustre filesystem cache, or about
> 1275MB of data. That shouldn't be too much for a system with 192GB of
> RAM...
>
> lustre_inode_cache 769 772 896 4 1 : tunables 54 27
>> 8 : slabdata 193 193 0
>> ldlm_locks 2624 3688 512 8 1 : tunables 54 27 8
>> : slabdata 461 461 0
>> ldlm_resources 2002 3340 384 10 1 : tunables 54 27 8
>> : slabdata 334 334 0
>>
>
> Only about 2600 locks on 770 files is fine (this is what the DLM LRU size
> would affect, if it were out of control, which it isn't).
>
> ll_obdo_cache 0 452282156 208 19 1 : tunables 120 60
>> 8 : slabdata 0 23804324 0
>>
>
> This is really out of whack. The "obdo" struct should normally only be
> allocated for a short time and then freed again, but here you have 452M of
> them using over 90GB of RAM. It looks like a leak of some kind, which is a
> bit surprising since we have fairly tight checking for memory leaks in the
> Lustre code.
>
> Are you running some unusual workload that is maybe walking an unusual code
> path? What you can do to track down memory leaks is enable Lustre memory
> tracing, increase the size of the debug buffer to catch enough tracing to be
> useful, and then run your job to see what is causing the leak, dump the
> kernel debug log, and then run leak-finder.pl (attached, and also in
> Lustre sources):
>
> client# lctl set_param debug=+malloc
> client# lctl set_param debug_mb=256
> client$ {run job}
> client# sync
> client# lctl dk /tmp/debug
> client# perl leak-finder.pl < /tmp/debug 2>&1 | grep "Leak.*oa"
> client# lctl set_param debug=-malloc
> client# lctl set_param debug_mb=32
>
> Since this is a running system, it will report spurious leaks for some
> kinds of allocations that remain in memory for some time (e.g. cached pages,
> inodes, etc), but with the exception of uncommitted RPCs (of which there
> should be none after the sync) there should not be any leaked obdo.
>
> On 2010-04-19, at 10:43, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> My users are reporting some issues with memory on our lustre 1.8.1
>>> clients. It looks like when they submit a single job at a time the run time
>>> was about 4.5 minutes. However, when they ran multiple jobs (10 or less) on
>>> a client with 192GB of memory on a single node the run time for each job was
>>> exceeding 3-4X the run time for the single process. They also noticed that
>>> the swap space kept climbing even though there was plenty of free memory on
>>> the system. Could this possibly be related to the lustre client? Does it
>>> reserve any memory that is not accessible by any other process even though
>>> it might not be in use?
>>>
>>
>>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Engineer, Lustre Group
> Oracle Corporation Canada Inc.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100420/4ca7afc6/attachment.htm>
More information about the lustre-discuss
mailing list