[lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)

Wed Apr 13 06:58:37 PDT 2016

Hi Mark,

On Tue, Apr 12, 2016 at 04:49:10PM -0400, Mark Hahn wrote:
>One of our MDSs is crashing with the following:
>
>BUG: unable to handle kernel paging request at 00000000deadbeef
>IP: [<ffffffffa0ce0328>] iam_container_init+0x18/0x70 [osd_ldiskfs]
>PGD 0
>Oops: 0002 [#1] SMP
>
>The MDS is running 2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64
>with about 2k clients ranging from 1.8.8 to 2.6.0

I saw an identical crash in Sep 2014 when the MDS was put under memory
pressure.

>to be related to vm.zone_reclaim_mode=1.  We also enabled quotas

zone_reclaim_mode should always be 0. 1 is broken. hung processes
perpetually 'scanning' in one zone in /proc/zoneinfo whilst plenty of
pages are free in another zone is a sure sign of this issue.

however if you have vm.zone_reclaim_mode=0 now and are still seeing the
issue, then I would suspect that lustre's overly agresssive memory
affinity code is partially to blame. at the very least it is most
likely stopping you from making use of half your MDS ram.

see
  https://jira.hpdd.intel.com/browse/LU-5050

set
  options libcfs cpu_npartitions=1
to fix it. that's what I use on OSS and MDS for all my clusters.

cheers,
robin