[lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)

Wed Apr 13 11:53:26 PDT 2016

>> We had to use lustre-2.5.3.90 on the MDS servers because of memory leak.
>>
>> https://jira.hpdd.intel.com/browse/LU-5726
>
> Mark,
>
> If you don?t have the patch for LU-5726, then you should definitely try to get that one.  If nothing else, reading through the bug report might be useful.  It details some of the MDS OOM problems I had and mentions setting vm.zone_reclaim_mode=0.  It also has Robin Humble?s suggestion of setting "options libcfs cpu_npartitions=1? (which is something that I started doing as well).

thanks, we'll be trying the LU-5726 patch and cpu_npartitions things.
it's quite a long thread - do I understand correctly that periodic
vm.drop_caches=1 can postpone the issue?  I can replicate the warning
signs mentioned in the thread (growth of Inactive(file) to dizzying 
heights when doing a lot of unlinks).

It seems odd that if this is purely a memory balance problem,
it manifests as a 0xdeadbeef panic, rather than OOM.  While I understand
that the oom-killer path itself need memory to operate, does this 
also imply that some allocation in the kernel or filesystem is not 
checking a return value?

thanks, mark hahn.