Hi Dmitry,<div><br></div><div>I am still running into this issue on some nodes:</div><div><br></div><div><div>client109: ll_obdo_cache 0 152914489 208 19 1 : tunables 120 60 8 : slabdata 0 8048131 0</div>
<div><div>client102: ll_obdo_cache 0 308526883 208 19 1 : tunables 120 60 8 : slabdata 0 16238257 0</div></div><div><br></div><div>How can I calculate how much memory this is holding on to. My system shows a lot of memory that is being used up but none of the jobs are using that much memory. Also, these clients are running a smp sles 11 kernel but I can't find any /sys/kernel/slab directory. </div>
<div><br></div><div><div>Linux client102 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 x86_64 x86_64 GNU/Linux</div></div><div><br></div><div>What makes you say that this does not look like a lustre memory leak? I thought all the ll_* objects in slabinfo are lustre related? To me it looks like lustre is holding on to this memory but I don't know much about lustre internals.</div>
<div><br></div><div>Also, memused on these systems are:</div><div><br></div><div><div>client102: 2353666940</div></div><div><div>client109: 2421645924</div></div><div><br></div><div>Any help would be greatly appreciated.</div>
<div><br></div><div>Thanks,</div><div>-J</div><br><div class="gmail_quote">On Wed, May 19, 2010 at 8:08 AM, Dmitry Zogin <span dir="ltr"><<a href="mailto:dmitry.zoguine@oracle.com">dmitry.zoguine@oracle.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div bgcolor="#ffffff" text="#000000">
Hello Jagga,<br>
<br>
I checked the data, and indeed this does not look like a lustre memory
leak, rather than a slab fragmentation, which assumes there might be a
kernel issue here. From the slabinfo (I only keep three first columns
here):<div class="im"><br>
<br>
name <active_objs> <num_objs><br></div><div class="im">
ll_obdo_cache 0 452282156 208<br>
<br></div>
means that there are no active objects, but the memory pages are not
released back from slab allocator to the free pool (the num value is
huge). That looks like a slab fragmentation - you can get more
description at <br>
<a href="http://kerneltrap.org/Linux/Slab_Defragmentation" target="_blank">http://kerneltrap.org/Linux/Slab_Defragmentation</a><br>
<br>
Checking your mails, I wonder if this only happens on clients which
have SLES11 installed? As the RAM size is around 192Gb, I assume they
are NUMA systems?<br>
If so, SLES11 has defrag_ratio tunables in /sys/kernel/slab/xxx<br>
>From the source of get_any_partial()<br>
<br>
#ifdef CONFIG_NUMA<br>
<br>
/*<br>
* The defrag ratio allows a configuration of the tradeoffs
between<br>
* inter node defragmentation and node local allocations. A
lower<br>
* defrag_ratio increases the tendency to do local allocations<br>
* instead of attempting to obtain partial slabs from other
nodes.<br>
*<br>
* If the defrag_ratio is set to 0 then kmalloc() always<br>
* returns node local objects. If the ratio is higher then
kmalloc()<br>
* may return off node objects because partial slabs are
obtained<br>
* from other nodes and filled up.<br>
*<br>
* If /sys/kernel/slab/xx/defrag_ratio is set to 100 (which
makes<br>
* defrag_ratio = 1000) then every (well almost) allocation will<br>
* first attempt to defrag slab caches on other nodes. This
means<br>
* scanning over all nodes to look for partial slabs which may
be<br>
* expensive if we do it every time we are trying to find a slab<br>
* with available objects.<br>
*/<br>
<br>
Could you please verify that your clients have defrag_ratio tunable and
try to use various values?<br>
It looks like the value of 100 should be the best, unless there is a
bug, then may be even 0 gets the desired result?<br>
<br>
Best regards,<br>
Dmitry<br>
<br>
<br>
Jagga Soorma wrote:
<blockquote type="cite"><div><div></div><div class="h5">Hi Johann,<br>
<br>
I am actually using 1.8.1 and not 1.8.2:<br>
<br>
# rpm -qa | grep -i lustre<br>
lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default<br>
lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default<br>
<br>
My kernel version on the SLES 11 clients is:<br>
# uname -r<br>
2.6.27.29-0.1-default<br>
<br>
My kernel version on the RHEL 5.3 mds/oss servers is:<br>
# uname -r<br>
2.6.18-128.7.1.el5_lustre.1.8.1.1<br>
<br>
Please let me know if you need any further information. I am still
trying to get the user to help me run his app so that I can run the
leak finder script to capture more information.<br>
<br>
Regards,<br>
-Simran<br>
<br>
<div class="gmail_quote">On Tue, Apr 27, 2010 at 7:20 AM, Johann
Lombardi <span dir="ltr"><<a href="mailto:johann@sun.com" target="_blank">johann@sun.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204);margin:0pt 0pt 0pt 0.8ex;padding-left:1ex">
<div>Hi,<br>
<br>
On Tue, Apr 20, 2010 at 09:08:25AM -0700, Jagga Soorma wrote:<br>
</div>
<div>> Thanks for your response.* I will try to run
the leak-finder script and<br>
> hopefully it will point us in the right direction.* This only
seems to be<br>
> happening on some of my clients:<br>
<br>
</div>
<div>Could you please tell us what kernel you use on the
client side?<br>
<br>
</div>
<div>> client104: ll_obdo_cache********* 0
433506280*** 208** 19*** 1 : tunables*<br>
> 120** 60*** 8 : slabdata***** 0 22816120***** 0<br>
> client116: ll_obdo_cache********* 0 457366746*** 208** 19*** 1
: tunables*<br>
> 120** 60*** 8 : slabdata***** 0 24071934***** 0<br>
> client113: ll_obdo_cache********* 0 456778867*** 208** 19*** 1
: tunables*<br>
> 120** 60*** 8 : slabdata***** 0 24040993***** 0<br>
> client106: ll_obdo_cache********* 0 456372267*** 208** 19*** 1
: tunables*<br>
> 120** 60*** 8 : slabdata***** 0 24019593***** 0<br>
> client115: ll_obdo_cache********* 0 449929310*** 208** 19*** 1
: tunables*<br>
> 120** 60*** 8 : slabdata***** 0 23680490***** 0<br>
> client101: ll_obdo_cache********* 0 454318101*** 208** 19*** 1
: tunables*<br>
> 120** 60*** 8 : slabdata***** 0 23911479***** 0<br>
> --<br>
><br>
> Hopefully this should help.* Not sure which application might
be causing<br>
> the leaks.* Currently R is the only app that users seem to be
using<br>
> heavily on these clients.* Will let you know what I find.<br>
<br>
</div>
<div>
<div>Tommi Tervo has filed a bugzilla ticket for this
issue, see<br>
<a href="https://bugzilla.lustre.org/show_bug.cgi?id=22701" target="_blank">https://bugzilla.lustre.org/show_bug.cgi?id=22701</a><br>
<br>
Could you please add a comment to this ticket to describe the<br>
behavior of the application "R" (fork many threads, write to<br>
many files, use direct i/o, ...)?<br>
<br>
Cheers,<br>
Johann<br>
</div>
</div>
</blockquote>
</div>
<br>
</div></div><pre><hr size="4" width="90%"><div class="im">
_______________________________________________
Lustre-discuss mailing list
<a href="mailto:Lustre-discuss@lists.lustre.org" target="_blank">Lustre-discuss@lists.lustre.org</a>
<a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss" target="_blank">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a>
</div></pre>
</blockquote>
<br>
</div>
</blockquote></div><br></div>