Hi Andreas,<br><br>Thanks for your response.  I will try to run the leak-finder script and hopefully it will point us in the right direction.  This only seems to be happening on some of my clients:<br><br>--<br>client112: ll_obdo_cache          0      0    208   19    1 : tunables  120   60    8 : slabdata      0      0      0<br>

client108: ll_obdo_cache          0      0    208   19    1 : tunables  120   60    8 : slabdata      0      0      0<br>client110: ll_obdo_cache          0      0    208   19    1 : tunables  120   60    8 : slabdata      0      0      0<br>

client107: ll_obdo_cache          0      0    208   19    1 : tunables  120   60    8 : slabdata      0      0      0<br>client111: ll_obdo_cache          0      0    208   19    1 : tunables  120   60    8 : slabdata      0      0      0<br>

client109: ll_obdo_cache          0      0    208   19    1 : tunables  120   60    8 : slabdata      0      0      0<br>client102: ll_obdo_cache          5     38    208   19    1 : tunables  120   60    8 : slabdata      2      2      1<br>

client114: ll_obdo_cache          0      0    208   19    1 : tunables  120   60    8 : slabdata      0      0      0<br>client105: ll_obdo_cache          0      0    208   19    1 : tunables  120   60    8 : slabdata      0      0      0<br>

client103: ll_obdo_cache          0      0    208   19    1 : tunables  120   60    8 : slabdata      0      0      0<br>client104: ll_obdo_cache          0 433506280    208   19    1 : tunables  120   60    8 : slabdata      0 22816120      0<br>

client116: ll_obdo_cache          0 457366746    208   19    1 : tunables  120   60    8 : slabdata      0 24071934      0<br>client113: ll_obdo_cache          0 456778867    208   19    1 : tunables  120   60    8 : slabdata      0 24040993      0<br>

client106: ll_obdo_cache          0 456372267    208   19    1 : tunables  120   60    8 : slabdata      0 24019593      0<br>client115: ll_obdo_cache          0 449929310    208   19    1 : tunables  120   60    8 : slabdata      0 23680490      0<br>

client101: ll_obdo_cache          0 454318101    208   19    1 : tunables  120   60    8 : slabdata      0 23911479      0<br>--<br><br>Hopefully this should help.  Not sure which application might be causing the leaks.  Currently R is the only app that users seem to be using heavily on these clients.  Will let you know what I find.<br>

<br>Thanks again,<br>-J<br><br><div class="gmail_quote">On Mon, Apr 19, 2010 at 9:04 PM, Andreas Dilger <span dir="ltr"><<a href="mailto:andreas.dilger@oracle.com">andreas.dilger@oracle.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im">On 2010-04-19, at 11:16, Jagga Soorma wrote:<br>

</div><div class="im"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

What is the known problem with the DLM LRU size?<br>

</blockquote>

<br></div><div class="im">

It is mostly a problem on the server, actually.<br>

<br>

</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">

  Here is what my slabinfo/meminfo look like on one of the clients.  I don't see anything out of the ordinary:<br>

<br>

(then again there are no jobs currently running on this system)<br>

<br></div><div class="im">

slabinfo - version: 2.1<br>

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail><br>


</div></blockquote>

<br><div class="im">

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

ll_async_page     326589 328572    320   12    1 : tunables   54   27    8 : slabdata  27381  27381      0<br>

</blockquote>

<br></div><div class="im">

This shows you have 326589 pages in the lustre filesystem cache, or about 1275MB of data.  That shouldn't be too much for a system with 192GB of RAM...<br>

<br>

</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">

lustre_inode_cache    769    772    896    4    1 : tunables   54   27    8 : slabdata    193    193      0<br></div><div class="im">

ldlm_locks          2624   3688    512    8    1 : tunables   54   27    8 : slabdata    461    461      0<br>

ldlm_resources      2002   3340    384   10    1 : tunables   54   27    8 : slabdata    334    334      0<br>

</div></blockquote>

<br><div class="im">

Only about 2600 locks on 770 files is fine (this is what the DLM LRU size would affect, if it were out of control, which it isn't).<br>

<br>

</div><div class="im"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

ll_obdo_cache          0 452282156    208   19    1 : tunables  120   60    8 : slabdata      0 23804324      0<br>

</blockquote>

<br></div><div class="im">

This is really out of whack.  The "obdo" struct should normally only be allocated for a short time and then freed again, but here you have 452M of them using over 90GB of RAM.  It looks like a leak of some kind, which is a bit surprising since we have fairly tight checking for memory leaks in the Lustre code.<br>


<br>

Are you running some unusual workload that is maybe walking an unusual code path?  What you can do to track down memory leaks is enable Lustre memory tracing, increase the size of the debug buffer to catch enough tracing to be useful, and then run your job to see what is causing the leak, dump the kernel debug log, and then run <a href="http://leak-finder.pl" target="_blank">leak-finder.pl</a> (attached, and also in Lustre sources):<br>


<br>

client# lctl set_param debug=+malloc<br>

client# lctl set_param debug_mb=256<br>

client$ {run job}<br>

client# sync<br>

client# lctl dk /tmp/debug<br>

client# perl <a href="http://leak-finder.pl" target="_blank">leak-finder.pl</a> < /tmp/debug 2>&1 | grep "Leak.*oa"<br>

client# lctl set_param debug=-malloc<br>

client# lctl set_param debug_mb=32<br>

<br>

Since this is a running system, it will report spurious leaks for some kinds of allocations that remain in memory for some time (e.g. cached pages, inodes, etc), but with the exception of uncommitted RPCs (of which there should be none after the sync) there should not be any leaked obdo.<br>


<br>

</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">

On 2010-04-19, at 10:43, Jagga Soorma <<a href="mailto:jagga13@gmail.com" target="_blank">jagga13@gmail.com</a>> wrote:<br>

</div><div class="im"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

My users are reporting some issues with memory on our lustre 1.8.1 clients.  It looks like when they submit a single job at a time the run time was about 4.5 minutes.  However, when they ran multiple jobs (10 or less) on a client with 192GB of memory on a single node the run time for each job was exceeding 3-4X the run time for the single process.  They also noticed that the swap space kept climbing even though there was plenty of free memory on the system.  Could this possibly be related to the lustre client?  Does it reserve any memory that is not accessible by any other process even though it might not be in use?<br>


</blockquote>

<br>

</div></blockquote><div><div></div><div class="h5">

<br>

Cheers, Andreas<br>

--<br>

Andreas Dilger<br>

Principal Engineer, Lustre Group<br>

Oracle Corporation Canada Inc.<br>

</div></div></blockquote></div><br>