[Lustre-discuss] Lustre Client - Memory Issue

Jagga Soorma jagga13 at gmail.com
Tue Apr 20 09:08:25 PDT 2010


Hi Andreas,

Thanks for your response.  I will try to run the leak-finder script and
hopefully it will point us in the right direction.  This only seems to be
happening on some of my clients:

--
client112: ll_obdo_cache          0      0    208   19    1 : tunables
120   60    8 : slabdata      0      0      0
client108: ll_obdo_cache          0      0    208   19    1 : tunables
120   60    8 : slabdata      0      0      0
client110: ll_obdo_cache          0      0    208   19    1 : tunables
120   60    8 : slabdata      0      0      0
client107: ll_obdo_cache          0      0    208   19    1 : tunables
120   60    8 : slabdata      0      0      0
client111: ll_obdo_cache          0      0    208   19    1 : tunables
120   60    8 : slabdata      0      0      0
client109: ll_obdo_cache          0      0    208   19    1 : tunables
120   60    8 : slabdata      0      0      0
client102: ll_obdo_cache          5     38    208   19    1 : tunables
120   60    8 : slabdata      2      2      1
client114: ll_obdo_cache          0      0    208   19    1 : tunables
120   60    8 : slabdata      0      0      0
client105: ll_obdo_cache          0      0    208   19    1 : tunables
120   60    8 : slabdata      0      0      0
client103: ll_obdo_cache          0      0    208   19    1 : tunables
120   60    8 : slabdata      0      0      0
client104: ll_obdo_cache          0 433506280    208   19    1 : tunables
120   60    8 : slabdata      0 22816120      0
client116: ll_obdo_cache          0 457366746    208   19    1 : tunables
120   60    8 : slabdata      0 24071934      0
client113: ll_obdo_cache          0 456778867    208   19    1 : tunables
120   60    8 : slabdata      0 24040993      0
client106: ll_obdo_cache          0 456372267    208   19    1 : tunables
120   60    8 : slabdata      0 24019593      0
client115: ll_obdo_cache          0 449929310    208   19    1 : tunables
120   60    8 : slabdata      0 23680490      0
client101: ll_obdo_cache          0 454318101    208   19    1 : tunables
120   60    8 : slabdata      0 23911479      0
--

Hopefully this should help.  Not sure which application might be causing the
leaks.  Currently R is the only app that users seem to be using heavily on
these clients.  Will let you know what I find.

Thanks again,
-J

On Mon, Apr 19, 2010 at 9:04 PM, Andreas Dilger
<andreas.dilger at oracle.com>wrote:

> On 2010-04-19, at 11:16, Jagga Soorma wrote:
>
>> What is the known problem with the DLM LRU size?
>>
>
> It is mostly a problem on the server, actually.
>
>   Here is what my slabinfo/meminfo look like on one of the clients.  I
>> don't see anything out of the ordinary:
>>
>> (then again there are no jobs currently running on this system)
>>
>> slabinfo - version: 2.1
>> # name            <active_objs> <num_objs> <objsize> <objperslab>
>> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
>> <active_slabs> <num_slabs> <sharedavail>
>>
>
>  ll_async_page     326589 328572    320   12    1 : tunables   54   27    8
>> : slabdata  27381  27381      0
>>
>
> This shows you have 326589 pages in the lustre filesystem cache, or about
> 1275MB of data.  That shouldn't be too much for a system with 192GB of
> RAM...
>
>  lustre_inode_cache    769    772    896    4    1 : tunables   54   27
>>  8 : slabdata    193    193      0
>> ldlm_locks          2624   3688    512    8    1 : tunables   54   27    8
>> : slabdata    461    461      0
>> ldlm_resources      2002   3340    384   10    1 : tunables   54   27    8
>> : slabdata    334    334      0
>>
>
> Only about 2600 locks on 770 files is fine (this is what the DLM LRU size
> would affect, if it were out of control, which it isn't).
>
>  ll_obdo_cache          0 452282156    208   19    1 : tunables  120   60
>>  8 : slabdata      0 23804324      0
>>
>
> This is really out of whack.  The "obdo" struct should normally only be
> allocated for a short time and then freed again, but here you have 452M of
> them using over 90GB of RAM.  It looks like a leak of some kind, which is a
> bit surprising since we have fairly tight checking for memory leaks in the
> Lustre code.
>
> Are you running some unusual workload that is maybe walking an unusual code
> path?  What you can do to track down memory leaks is enable Lustre memory
> tracing, increase the size of the debug buffer to catch enough tracing to be
> useful, and then run your job to see what is causing the leak, dump the
> kernel debug log, and then run leak-finder.pl (attached, and also in
> Lustre sources):
>
> client# lctl set_param debug=+malloc
> client# lctl set_param debug_mb=256
> client$ {run job}
> client# sync
> client# lctl dk /tmp/debug
> client# perl leak-finder.pl < /tmp/debug 2>&1 | grep "Leak.*oa"
> client# lctl set_param debug=-malloc
> client# lctl set_param debug_mb=32
>
> Since this is a running system, it will report spurious leaks for some
> kinds of allocations that remain in memory for some time (e.g. cached pages,
> inodes, etc), but with the exception of uncommitted RPCs (of which there
> should be none after the sync) there should not be any leaked obdo.
>
>  On 2010-04-19, at 10:43, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> My users are reporting some issues with memory on our lustre 1.8.1
>>> clients.  It looks like when they submit a single job at a time the run time
>>> was about 4.5 minutes.  However, when they ran multiple jobs (10 or less) on
>>> a client with 192GB of memory on a single node the run time for each job was
>>> exceeding 3-4X the run time for the single process.  They also noticed that
>>> the swap space kept climbing even though there was plenty of free memory on
>>> the system.  Could this possibly be related to the lustre client?  Does it
>>> reserve any memory that is not accessible by any other process even though
>>> it might not be in use?
>>>
>>
>>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Engineer, Lustre Group
> Oracle Corporation Canada Inc.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100420/4ca7afc6/attachment.htm>


More information about the lustre-discuss mailing list