[lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)

Wed Apr 13 12:22:02 PDT 2016

> On Apr 13, 2016, at 2:53 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
> thanks, we'll be trying the LU-5726 patch and cpu_npartitions things.
> it's quite a long thread - do I understand correctly that periodic
> vm.drop_caches=1 can postpone the issue?

Not really.  I was periodically dropping the caches as a way to monitor how fast the memory was leaking.  I was trying to distinguish between normal cache usage (which allows memory to be reclaimed) and some other non-reclaimable usage.  The rate at which the memory leak grows depends upon how many file unlinks are happening (and based on my testing, sometimes the pattern in which they are unlinked).  But in general, the memory usage will just continue to gradually grow.

> It seems odd that if this is purely a memory balance problem,
> it manifests as a 0xdeadbeef panic, rather than OOM.  While I understand
> that the oom-killer path itself need memory to operate, does this also imply that some allocation in the kernel or filesystem is not checking a return value?

Not sure about that one, but in my experience when nodes start to run out of memory, they can fail in all sorts of new and exciting ways.  IIRC, SDSC also had an issue with LU-5726 but the symptoms they saw were not identical to mine.  So maybe you are seeing the same problem manifest itself in a different way.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu