Hi Andreas,<br><br>Thanks for your response. I will try to run the leak-finder script and hopefully it will point us in the right direction. This only seems to be happening on some of my clients:<br><br>--<br>client112: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0<br>
client108: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0<br>client110: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0<br>
client107: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0<br>client111: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0<br>
client109: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0<br>client102: ll_obdo_cache 5 38 208 19 1 : tunables 120 60 8 : slabdata 2 2 1<br>
client114: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0<br>client105: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0<br>
client103: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0<br>client104: ll_obdo_cache 0 433506280 208 19 1 : tunables 120 60 8 : slabdata 0 22816120 0<br>
client116: ll_obdo_cache 0 457366746 208 19 1 : tunables 120 60 8 : slabdata 0 24071934 0<br>client113: ll_obdo_cache 0 456778867 208 19 1 : tunables 120 60 8 : slabdata 0 24040993 0<br>
client106: ll_obdo_cache 0 456372267 208 19 1 : tunables 120 60 8 : slabdata 0 24019593 0<br>client115: ll_obdo_cache 0 449929310 208 19 1 : tunables 120 60 8 : slabdata 0 23680490 0<br>
client101: ll_obdo_cache 0 454318101 208 19 1 : tunables 120 60 8 : slabdata 0 23911479 0<br>--<br><br>Hopefully this should help. Not sure which application might be causing the leaks. Currently R is the only app that users seem to be using heavily on these clients. Will let you know what I find.<br>
<br>Thanks again,<br>-J<br><br><div class="gmail_quote">On Mon, Apr 19, 2010 at 9:04 PM, Andreas Dilger <span dir="ltr"><<a href="mailto:andreas.dilger@oracle.com">andreas.dilger@oracle.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">On 2010-04-19, at 11:16, Jagga Soorma wrote:<br>
</div><div class="im"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
What is the known problem with the DLM LRU size?<br>
</blockquote>
<br></div><div class="im">
It is mostly a problem on the server, actually.<br>
<br>
</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">
Here is what my slabinfo/meminfo look like on one of the clients. I don't see anything out of the ordinary:<br>
<br>
(then again there are no jobs currently running on this system)<br>
<br></div><div class="im">
slabinfo - version: 2.1<br>
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail><br>
</div></blockquote>
<br><div class="im">
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
ll_async_page 326589 328572 320 12 1 : tunables 54 27 8 : slabdata 27381 27381 0<br>
</blockquote>
<br></div><div class="im">
This shows you have 326589 pages in the lustre filesystem cache, or about 1275MB of data. That shouldn't be too much for a system with 192GB of RAM...<br>
<br>
</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">
lustre_inode_cache 769 772 896 4 1 : tunables 54 27 8 : slabdata 193 193 0<br></div><div class="im">
ldlm_locks 2624 3688 512 8 1 : tunables 54 27 8 : slabdata 461 461 0<br>
ldlm_resources 2002 3340 384 10 1 : tunables 54 27 8 : slabdata 334 334 0<br>
</div></blockquote>
<br><div class="im">
Only about 2600 locks on 770 files is fine (this is what the DLM LRU size would affect, if it were out of control, which it isn't).<br>
<br>
</div><div class="im"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
ll_obdo_cache 0 452282156 208 19 1 : tunables 120 60 8 : slabdata 0 23804324 0<br>
</blockquote>
<br></div><div class="im">
This is really out of whack. The "obdo" struct should normally only be allocated for a short time and then freed again, but here you have 452M of them using over 90GB of RAM. It looks like a leak of some kind, which is a bit surprising since we have fairly tight checking for memory leaks in the Lustre code.<br>
<br>
Are you running some unusual workload that is maybe walking an unusual code path? What you can do to track down memory leaks is enable Lustre memory tracing, increase the size of the debug buffer to catch enough tracing to be useful, and then run your job to see what is causing the leak, dump the kernel debug log, and then run <a href="http://leak-finder.pl" target="_blank">leak-finder.pl</a> (attached, and also in Lustre sources):<br>
<br>
client# lctl set_param debug=+malloc<br>
client# lctl set_param debug_mb=256<br>
client$ {run job}<br>
client# sync<br>
client# lctl dk /tmp/debug<br>
client# perl <a href="http://leak-finder.pl" target="_blank">leak-finder.pl</a> < /tmp/debug 2>&1 | grep "Leak.*oa"<br>
client# lctl set_param debug=-malloc<br>
client# lctl set_param debug_mb=32<br>
<br>
Since this is a running system, it will report spurious leaks for some kinds of allocations that remain in memory for some time (e.g. cached pages, inodes, etc), but with the exception of uncommitted RPCs (of which there should be none after the sync) there should not be any leaked obdo.<br>
<br>
</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">
On 2010-04-19, at 10:43, Jagga Soorma <<a href="mailto:jagga13@gmail.com" target="_blank">jagga13@gmail.com</a>> wrote:<br>
</div><div class="im"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
My users are reporting some issues with memory on our lustre 1.8.1 clients. It looks like when they submit a single job at a time the run time was about 4.5 minutes. However, when they ran multiple jobs (10 or less) on a client with 192GB of memory on a single node the run time for each job was exceeding 3-4X the run time for the single process. They also noticed that the swap space kept climbing even though there was plenty of free memory on the system. Could this possibly be related to the lustre client? Does it reserve any memory that is not accessible by any other process even though it might not be in use?<br>
</blockquote>
<br>
</div></blockquote><div><div></div><div class="h5">
<br>
Cheers, Andreas<br>
--<br>
Andreas Dilger<br>
Principal Engineer, Lustre Group<br>
Oracle Corporation Canada Inc.<br>
</div></div></blockquote></div><br>