Hi Dmitry,<div><br></div><div>I am still running into this issue on some nodes:</div><div><br></div><div><div>client109: ll_obdo_cache          0 152914489    208   19    1 : tunables  120   60    8 : slabdata      0 8048131      0</div>

<div><div>client102: ll_obdo_cache          0 308526883    208   19    1 : tunables  120   60    8 : slabdata      0 16238257      0</div></div><div><br></div><div>How can I calculate how much memory this is holding on to.  My system shows a lot of memory that is being used up but none of the jobs are using that much memory.  Also, these clients are running a smp sles 11 kernel but I can't find any /sys/kernel/slab directory.  </div>

<div><br></div><div><div>Linux client102 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 x86_64 x86_64 GNU/Linux</div></div><div><br></div><div>What makes you say that this does not look like a lustre memory leak?  I thought all the ll_* objects in slabinfo are lustre related?  To me it looks like lustre is holding on to this memory but I don't know much about lustre internals.</div>

<div><br></div><div>Also, memused on these systems are:</div><div><br></div><div><div>client102: 2353666940</div></div><div><div>client109: 2421645924</div></div><div><br></div><div>Any help would be greatly appreciated.</div>

<div><br></div><div>Thanks,</div><div>-J</div><br><div class="gmail_quote">On Wed, May 19, 2010 at 8:08 AM, Dmitry Zogin <span dir="ltr"><<a href="mailto:dmitry.zoguine@oracle.com">dmitry.zoguine@oracle.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div bgcolor="#ffffff" text="#000000">

Hello Jagga,<br>

<br>

I checked the data, and indeed this does not look like a lustre memory

leak, rather than a slab fragmentation, which assumes there might be a

kernel issue here. From the slabinfo (I only keep three first columns

here):<div class="im"><br>

<br>

name            <active_objs> <num_objs><br></div><div class="im">

ll_obdo_cache          0 452282156    208<br>

<br></div>

means that there are no active objects, but the memory pages are not

released back from slab allocator to the free pool (the num value is

huge). That looks like a slab fragmentation - you can get more

description at <br>

<a href="http://kerneltrap.org/Linux/Slab_Defragmentation" target="_blank">http://kerneltrap.org/Linux/Slab_Defragmentation</a><br>

<br>

Checking your mails, I wonder if this only happens on clients which

have  SLES11 installed? As the RAM size is around 192Gb, I assume they

are NUMA systems?<br>

If so, SLES11 has defrag_ratio tunables in /sys/kernel/slab/xxx<br>

>From the source of get_any_partial()<br>

<br>

#ifdef CONFIG_NUMA<br>

<br>

        /*<br>

         * The defrag ratio allows a configuration of the tradeoffs

between<br>

         * inter node defragmentation and node local allocations. A

lower<br>

         * defrag_ratio increases the tendency to do local allocations<br>

         * instead of attempting to obtain partial slabs from other

nodes.<br>

         *<br>

         * If the defrag_ratio is set to 0 then kmalloc() always<br>

         * returns node local objects. If the ratio is higher then

kmalloc()<br>

         * may return off node objects because partial slabs are

obtained<br>

         * from other nodes and filled up.<br>

         *<br>

         * If /sys/kernel/slab/xx/defrag_ratio is set to 100 (which

makes<br>

         * defrag_ratio = 1000) then every (well almost) allocation will<br>

         * first attempt to defrag slab caches on other nodes. This

means<br>

         * scanning over all nodes to look for partial slabs which may

be<br>

         * expensive if we do it every time we are trying to find a slab<br>

         * with available objects.<br>

         */<br>

<br>

Could you please verify that your clients have defrag_ratio tunable and

try to use various values?<br>

It looks like the value of 100 should be the best, unless there is a

bug, then may be even 0 gets the desired result?<br>

<br>

Best regards,<br>

Dmitry<br>

<br>

<br>

Jagga Soorma wrote:

<blockquote type="cite"><div><div></div><div class="h5">Hi Johann,<br>

  <br>

I am actually using 1.8.1 and not 1.8.2:<br>

  <br>

# rpm -qa | grep -i lustre<br>

lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default<br>

lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default<br>

  <br>

My kernel version on the SLES 11 clients is:<br>

# uname -r<br>

2.6.27.29-0.1-default<br>

  <br>

My kernel version on the RHEL 5.3 mds/oss servers is:<br>

# uname -r<br>

2.6.18-128.7.1.el5_lustre.1.8.1.1<br>

  <br>

Please let me know if you need any further information.  I am still

trying to get the user to help me run his app so that I can run the

leak finder script to capture more information.<br>

  <br>

Regards,<br>

-Simran<br>

  <br>

  <div class="gmail_quote">On Tue, Apr 27, 2010 at 7:20 AM, Johann

Lombardi <span dir="ltr"><<a href="mailto:johann@sun.com" target="_blank">johann@sun.com</a>></span> wrote:<br>

  <blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204);margin:0pt 0pt 0pt 0.8ex;padding-left:1ex">

    <div>Hi,<br>

    <br>

On Tue, Apr 20, 2010 at 09:08:25AM -0700, Jagga Soorma wrote:<br>

    </div>

    <div>> Thanks for your response.* I will try to run

the leak-finder script and<br>

> hopefully it will point us in the right direction.* This only

seems to be<br>

> happening on some of my clients:<br>

    <br>

    </div>

    <div>Could you please tell us what kernel you use on the

client side?<br>

    <br>

    </div>

    <div>>    client104: ll_obdo_cache********* 0

433506280*** 208** 19*** 1 : tunables*<br>

>    120** 60*** 8 : slabdata***** 0 22816120***** 0<br>

>    client116: ll_obdo_cache********* 0 457366746*** 208** 19*** 1

: tunables*<br>

>    120** 60*** 8 : slabdata***** 0 24071934***** 0<br>

>    client113: ll_obdo_cache********* 0 456778867*** 208** 19*** 1

: tunables*<br>

>    120** 60*** 8 : slabdata***** 0 24040993***** 0<br>

>    client106: ll_obdo_cache********* 0 456372267*** 208** 19*** 1

: tunables*<br>

>    120** 60*** 8 : slabdata***** 0 24019593***** 0<br>

>    client115: ll_obdo_cache********* 0 449929310*** 208** 19*** 1

: tunables*<br>

>    120** 60*** 8 : slabdata***** 0 23680490***** 0<br>

>    client101: ll_obdo_cache********* 0 454318101*** 208** 19*** 1

: tunables*<br>

>    120** 60*** 8 : slabdata***** 0 23911479***** 0<br>

>    --<br>

><br>

>    Hopefully this should help.* Not sure which application might

be causing<br>

>    the leaks.* Currently R is the only app that users seem to be

using<br>

>    heavily on these clients.* Will let you know what I find.<br>

    <br>

    </div>

    <div>

    <div>Tommi Tervo has filed a bugzilla ticket for this

issue, see<br>

    <a href="https://bugzilla.lustre.org/show_bug.cgi?id=22701" target="_blank">https://bugzilla.lustre.org/show_bug.cgi?id=22701</a><br>

    <br>

Could you please add a comment to this ticket to describe the<br>

behavior of the application "R" (fork many threads, write to<br>

many files, use direct i/o, ...)?<br>

    <br>

Cheers,<br>

Johann<br>

    </div>

    </div>

  </blockquote>

  </div>

  <br>

  </div></div><pre><hr size="4" width="90%"><div class="im">

_______________________________________________

Lustre-discuss mailing list

<a href="mailto:Lustre-discuss@lists.lustre.org" target="_blank">Lustre-discuss@lists.lustre.org</a>

<a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss" target="_blank">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a>

  </div></pre>

</blockquote>

<br>

</div>

</blockquote></div><br></div>