[lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)
Mark Hahn
hahn at mcmaster.ca
Wed Apr 13 11:53:26 PDT 2016
>> We had to use lustre-2.5.3.90 on the MDS servers because of memory leak.
>>
>> https://jira.hpdd.intel.com/browse/LU-5726
>
> Mark,
>
> If you don?t have the patch for LU-5726, then you should definitely try to get that one. If nothing else, reading through the bug report might be useful. It details some of the MDS OOM problems I had and mentions setting vm.zone_reclaim_mode=0. It also has Robin Humble?s suggestion of setting "options libcfs cpu_npartitions=1? (which is something that I started doing as well).
thanks, we'll be trying the LU-5726 patch and cpu_npartitions things.
it's quite a long thread - do I understand correctly that periodic
vm.drop_caches=1 can postpone the issue? I can replicate the warning
signs mentioned in the thread (growth of Inactive(file) to dizzying
heights when doing a lot of unlinks).
It seems odd that if this is purely a memory balance problem,
it manifests as a 0xdeadbeef panic, rather than OOM. While I understand
that the oom-killer path itself need memory to operate, does this
also imply that some allocation in the kernel or filesystem is not
checking a return value?
thanks, mark hahn.
More information about the lustre-discuss
mailing list