[lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)

Tue Apr 12 17:05:59 PDT 2016

Ew.  Well, that doesn't help. :)

Can you configure kdump on the node?  That would get you both dmesg and a dump.  Dmesg would include the rest of the stack trace.  (I'm hoping to give you a better idea whether or not the quota code is involved.)  A dump would let a developer type dig deeper as well, though it would also contain private info from your server.

- Patrick
________________________________________
From: Mark Hahn [hahn at mcmaster.ca]
Sent: Tuesday, April 12, 2016 4:39 PM
To: Patrick Farrell
Cc: lustre-discuss at lists.lustre.org
Subject: RE: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)

> Giving the rest of the back trace of the crash would help for developers
>looking at it.
>
> It's a lot easier to tell what code is involved with the whole trace.

thanks.  I'm sure that's the case, but these oopsen are truncated.
well, one was slightly longer:

BUG: unable to handle kernel paging request at 00000000deadbeef
IP: [<ffffffffa0cde328>] iam_container_init+0x18/0x70 [osd_ldiskfs]
PGD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/devices/system/cpu/online
CPU 14
Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 iTCO_wdt iTCO_vendor_support serio_raw raid10 i2c_i801 lpc_ich mfd_core ipmi_devintf mlx4_core sg acpi_pad igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext4 jbd2 mbcache raid1 sr_mod cdrom sd_mod crc_t10dif isci libsas mpt2sas scsi_transport_sas raid_class ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 7768, comm: mdt00_039 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1 Supermicro SYS-2027R-WRF/X9DRW

by way of straw-grasping, I'll mention two other very frequent messages
we're seeing on the MDS in question:

Lustre: 17673:0:(mdt_xattr.c:465:mdt_reint_setxattr()) covework-MDT0000: client miss to set OBD_MD_FLCTIME when setxattr system.posix_acl_access: [object [0x200031f84:0x1cad0:0x0]] [valid 68719476736]

(which seems to be https://jira.hpdd.intel.com/browse/LU-532 and a
consequence of some of our very old clients.  but not MDS-crash-able.)

LustreError: 22970:0:(tgt_lastrcvd.c:813:tgt_last_rcvd_update()) covework-MDT0000: trying to overwrite bigger transno:on-disk: 197587694105, new: 197587694104 replay: 0. see LU-617.

perplexing because the MDS is 2.5.3 and
https://jira.hpdd.intel.com/browse/LU-617 shows fixed circa 2.2.0/2.1.2.
(and our problem isn't with recovery afaikt.)

thanks!

regards,
Mark Hahn | SHARCnet Sysadmin | hahn at sharcnet.ca | http://www.sharcnet.ca
           | McMaster RHPCS    | hahn at mcmaster.ca | 905 525 9140 x24687
           | Compute/Calcul Canada                | http://www.computecanada.ca