[lustre-discuss] Kernel panic on mounting MGS

Thu Jun 25 22:03:11 PDT 2015

Hi!

Sorry; forgot to append the syslog messages related to the kernel panic, 
if that helps. here they are:
--------- Syslog messages when MGS mounted 
----------------------------------
----------Mount command "mount -t lustre /dev/mapper/mpatha /mnt/mgs" 
-------
Jun 25 13:00:26 nas-0-0 kernel: BUG: unable to handle kernel NULL 
pointer dereference at 0000000000000018
Jun 25 13:00:26 nas-0-0 kernel: IP: [<ffffffffa03cb30c>] 
lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: PGD 276664067 PUD 27420b067 PMD 0
Jun 25 13:00:26 nas-0-0 kernel: Oops: 0000 [#1] SMP
Jun 25 13:00:26 nas-0-0 kernel: last sysfs file: 
/sys/devices/pci0000:00/0000:00:07.0/0000:1f:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0/block/sdd/queue/max_sectors_kb
Jun 25 13:00:26 nas-0-0 kernel: CPU 0
Jun 25 13:00:26 nas-0-0 kernel: Modules linked in: cmm(U) osd_ldiskfs(U) 
mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgc(U) lustre(U) lov(U) osc(U) 
lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) ib_ipoib nfsd lockd nfs_acl 
auth_rpcgss exportfs autofs4 sunrpc ipmi_devintf ipmi_si ipmi_msghandler 
cpufreq_ondemand acpi_cpufreq freq_table mperf ldiskfs(U) ko2iblnd(U) 
rdma_cm ib_cm iw_cm ib_sa ib_addr ipv6 obdclass(U) lnet(U) lvfs(U) 
libcfs(U) ib_qib ib_mad ib_core bnx2 microcode cdc_ether usbnet mii 
serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg ioatdma dca 
i7core_edac edac_core shpchp ext4 mbcache jbd2 dm_round_robin 
scsi_dh_rdac sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptsas 
mptscsih mptbase mpt2sas scsi_transport_sas raid_class dm_multipath 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jun 25 13:00:26 nas-0-0 kernel:
Jun 25 13:00:26 nas-0-0 kernel: Pid: 30426, comm: mount.lustre Not 
tainted 2.6.32-279.14.1.el6_lustre.x86_64 #1 IBM System x3650 M3 
-[7945FT1]-/00J6159
Jun 25 13:00:26 nas-0-0 kernel: RIP: 0010:[<ffffffffa03cb30c>]  
[<ffffffffa03cb30c>] lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: RSP: 0018:ffff88025b87fd08 EFLAGS: 00010282
Jun 25 13:00:26 nas-0-0 kernel: RAX: 0000000000000000 RBX: 
ffff880275682400 RCX: 0000000000000009
Jun 25 13:00:26 nas-0-0 kernel: RDX: 000000000000015d RSI: 
ffffffffa03f8860 RDI: ffffffffa04473e0
Jun 25 13:00:26 nas-0-0 kernel: RBP: ffff88025b87fd98 R08: 
0000000000000073 R09: 0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: R10: 0000000000000001 R11: 
0000000000000001 R12: ffff880271586cc0
Jun 25 13:00:26 nas-0-0 kernel: R13: ffff880276088cc0 R14: 
ffff880276720000 R15: ffff880271586cc0
Jun 25 13:00:26 nas-0-0 kernel: FS:  00007f10fea95700(0000) 
GS:ffff880028200000(0000) knlGS:0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Jun 25 13:00:26 nas-0-0 kernel: CR2: 0000000000000018 CR3: 
0000000277b82000 CR4: 00000000000006f0
Jun 25 13:00:26 nas-0-0 kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
Jun 25 13:00:26 nas-0-0 kernel: Process mount.lustre (pid: 30426, 
threadinfo ffff88025b87e000, task ffff880274496040)
Jun 25 13:00:26 nas-0-0 kernel: Stack:
Jun 25 13:00:26 nas-0-0 kernel: ffff88025b87fd38 ffffffff8127a3fa 
ffff88025b87fd38 0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: <d> 0000000000000000 ffffffffa041c190 
ffff88025b87fd98 ffffffff8117e123
Jun 25 13:00:26 nas-0-0 kernel: <d> ffff880275682470 ffffffff8117d200 
ffff880271586c88 00000000cf124357
Jun 25 13:00:26 nas-0-0 kernel: Call Trace:
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8127a3fa>] ? strlcpy+0x4a/0x60
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e123>] ? sget+0x3e3/0x480
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117d200>] ? 
set_anon_super+0x0/0x100
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffffa03ca330>] ? 
lustre_fill_super+0x0/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffffa03bba65>] 
lustre_get_sb+0x25/0x30 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e2cb>] 
vfs_kern_mount+0x7b/0x1b0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e472>] 
do_kern_mount+0x52/0x130
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8119cb52>] do_mount+0x2d2/0x8d0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8119d1e0>] sys_mount+0x90/0xe0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8100b0f2>] 
system_call_fastpath+0x16/0x1b
Jun 25 13:00:26 nas-0-0 kernel: Code: a0 48 c7 05 fb c0 07 00 70 f2 3e 
a0 c7 05 fd c0 07 00 1c 02 00 00 48 c7 05 fe c0 07 00 10 74 44 a0 c7 05 
ec c0 07 00 00 00 02 02 <4c> 8b 40 18 31 c0 49 83 c0 60 e8 95 3b ed ff 
f6 05 e2 a2 ee ff
Jun 25 13:00:26 nas-0-0 kernel: RIP  [<ffffffffa03cb30c>] 
lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: RSP <ffff88025b87fd08>
Jun 25 13:00:26 nas-0-0 kernel: CR2: 0000000000000018
Jun 25 13:00:26 nas-0-0 kernel: ---[ end trace 5f2e504657a55b57 ]---
Jun 25 13:00:26 nas-0-0 kernel: Kernel panic - not syncing: Fatal exception
Jun 25 13:00:26 nas-0-0 kernel: Pid: 30426, comm: mount.lustre Tainted: 
G      D    --------------- 2.6.32-279.14.1.el6_lustre.x86_64 #1

Thanks!

Sumit

On 06/26/2015 10:27 AM, Sumit Mookerjee wrote:
> Hi!
>
> We run a 55 TB Lustre file system for our HPC users, with an MGS and 
> an MDT on one node (nas-0-0), and four OSTs, two partitions on each of 
> two nodes. After a year of stable operations, we had a major cooling 
> system failure, and all the servers and clients crashed.
>
> Since then, have not been able to mount the MGS partition; the server 
> simply crashes. I can mount the MDT, and the OSTs, but that does not 
> help without the MGS running. I can mount the MGS with ldiskfs. An 
> e2fsck on the MGS partition (also on the MDT and OST partitions) shows 
> up no issues.
>
> Is there any way I can recover the MGS? I read that just doing a 
> writeconf on the MDTs and the OSTs would regenerate the MGS config, 
> but that does not seem to help (perhaps because the MGS cannot be 
> mounted as lustre in the first place?).
>
> Have also tried creating a new MGS (mkfs.lustre --reformat --mgs) on a 
> spare partition we had on nas-0-0. The mkfs seems to complete without 
> errors, but the system crashes again when I try to mount this new 
> partition as lustre.
>
> Is there any way to fix the problem without deleting all data from the 
> MDT/OSTs (in short, starting afresh)?
> Am at my wit's end, and clearly do not know enough to understand what 
> is going on. Any help much appreciated!
>
> Thank you.
>
> Sumit Mookerjee
>

-- 
-----------------------------------------------------------------------------
Sumit Mookerjee

Inter University Accelerator Centre
Aruna Asaf Ali Marg
New Delhi 110067
India

Phones: + 91 11 26893955, 26899232 ext. 8252
Fax: +91 11 26893666
E-mail: sumit at iuac.res.in
-----------------------------------------------------------------------------