[Lustre-discuss] ldlm_locks memory usage crashes OSS

Tue Sep 25 08:12:19 PDT 2012

Hi All,

We have a problem with one of our OSS, crashing out of memory, on a
system that we recently re-install.  Our system uses two OSS with two
OST on each, running Lustre 2.1.3 on CentOS 6.3 with the kernel
2.6.32-220.17.1.el6_lustre.x86_64 (so, 64bits).

One of the OSS is getting low on memory until it provocs a kernel
panic. Checking with 'slabtop', it comes from the memory usage of
"ldlm_locks" that keeps growing forever (until it crashes).  The
growing rate is rather quick: close to 1Mb per second, so in ~1h it
takes it all.

It may be related with the following bug:
https://bugzilla.lustre.org/show_bug.cgi?id=19950
However, this was for lustre 1.6, so I'm not sure.

I tried rebooting, resyncronizing with the MDS afterwards, the same
happends again.  Now that I check the other OSS (the one that is ok)
carefully, the same seems to happen but at a very slow growing
rate. Not sure yet.

This may be a consequence, or related anyhow:
We are using Lustre on a computing cluster with Sun Grid Engine 6.2u5,
and any jobs we submit takes a *HUGE* amount of memory compare to what
it was needing before our upgrade (and what it takes if we run it
directly, not through SGE). If the measure we get from SGE are
correct, the difference can be up to x1000: many jobs then get killed.
Sorry if it is not the proper place to post this, but I have the
intuition that this
could be related and some people here could be used to this combination
Lustre+SGE.

Any suggestion welcome!

Thanks,

    Jérémie