[lustre-discuss] Kernel panic on mounting MGS
Sumit Mookerjee
sumit at iuac.res.in
Thu Jun 25 22:03:11 PDT 2015
Hi!
Sorry; forgot to append the syslog messages related to the kernel panic,
if that helps. here they are:
--------- Syslog messages when MGS mounted
----------------------------------
----------Mount command "mount -t lustre /dev/mapper/mpatha /mnt/mgs"
-------
Jun 25 13:00:26 nas-0-0 kernel: BUG: unable to handle kernel NULL
pointer dereference at 0000000000000018
Jun 25 13:00:26 nas-0-0 kernel: IP: [<ffffffffa03cb30c>]
lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: PGD 276664067 PUD 27420b067 PMD 0
Jun 25 13:00:26 nas-0-0 kernel: Oops: 0000 [#1] SMP
Jun 25 13:00:26 nas-0-0 kernel: last sysfs file:
/sys/devices/pci0000:00/0000:00:07.0/0000:1f:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0/block/sdd/queue/max_sectors_kb
Jun 25 13:00:26 nas-0-0 kernel: CPU 0
Jun 25 13:00:26 nas-0-0 kernel: Modules linked in: cmm(U) osd_ldiskfs(U)
mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgc(U) lustre(U) lov(U) osc(U)
lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) ib_ipoib nfsd lockd nfs_acl
auth_rpcgss exportfs autofs4 sunrpc ipmi_devintf ipmi_si ipmi_msghandler
cpufreq_ondemand acpi_cpufreq freq_table mperf ldiskfs(U) ko2iblnd(U)
rdma_cm ib_cm iw_cm ib_sa ib_addr ipv6 obdclass(U) lnet(U) lvfs(U)
libcfs(U) ib_qib ib_mad ib_core bnx2 microcode cdc_ether usbnet mii
serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg ioatdma dca
i7core_edac edac_core shpchp ext4 mbcache jbd2 dm_round_robin
scsi_dh_rdac sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptsas
mptscsih mptbase mpt2sas scsi_transport_sas raid_class dm_multipath
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jun 25 13:00:26 nas-0-0 kernel:
Jun 25 13:00:26 nas-0-0 kernel: Pid: 30426, comm: mount.lustre Not
tainted 2.6.32-279.14.1.el6_lustre.x86_64 #1 IBM System x3650 M3
-[7945FT1]-/00J6159
Jun 25 13:00:26 nas-0-0 kernel: RIP: 0010:[<ffffffffa03cb30c>]
[<ffffffffa03cb30c>] lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: RSP: 0018:ffff88025b87fd08 EFLAGS: 00010282
Jun 25 13:00:26 nas-0-0 kernel: RAX: 0000000000000000 RBX:
ffff880275682400 RCX: 0000000000000009
Jun 25 13:00:26 nas-0-0 kernel: RDX: 000000000000015d RSI:
ffffffffa03f8860 RDI: ffffffffa04473e0
Jun 25 13:00:26 nas-0-0 kernel: RBP: ffff88025b87fd98 R08:
0000000000000073 R09: 0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: R10: 0000000000000001 R11:
0000000000000001 R12: ffff880271586cc0
Jun 25 13:00:26 nas-0-0 kernel: R13: ffff880276088cc0 R14:
ffff880276720000 R15: ffff880271586cc0
Jun 25 13:00:26 nas-0-0 kernel: FS: 00007f10fea95700(0000)
GS:ffff880028200000(0000) knlGS:0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Jun 25 13:00:26 nas-0-0 kernel: CR2: 0000000000000018 CR3:
0000000277b82000 CR4: 00000000000006f0
Jun 25 13:00:26 nas-0-0 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Jun 25 13:00:26 nas-0-0 kernel: Process mount.lustre (pid: 30426,
threadinfo ffff88025b87e000, task ffff880274496040)
Jun 25 13:00:26 nas-0-0 kernel: Stack:
Jun 25 13:00:26 nas-0-0 kernel: ffff88025b87fd38 ffffffff8127a3fa
ffff88025b87fd38 0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: <d> 0000000000000000 ffffffffa041c190
ffff88025b87fd98 ffffffff8117e123
Jun 25 13:00:26 nas-0-0 kernel: <d> ffff880275682470 ffffffff8117d200
ffff880271586c88 00000000cf124357
Jun 25 13:00:26 nas-0-0 kernel: Call Trace:
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8127a3fa>] ? strlcpy+0x4a/0x60
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e123>] ? sget+0x3e3/0x480
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117d200>] ?
set_anon_super+0x0/0x100
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffffa03ca330>] ?
lustre_fill_super+0x0/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffffa03bba65>]
lustre_get_sb+0x25/0x30 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e2cb>]
vfs_kern_mount+0x7b/0x1b0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e472>]
do_kern_mount+0x52/0x130
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8119cb52>] do_mount+0x2d2/0x8d0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8119d1e0>] sys_mount+0x90/0xe0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8100b0f2>]
system_call_fastpath+0x16/0x1b
Jun 25 13:00:26 nas-0-0 kernel: Code: a0 48 c7 05 fb c0 07 00 70 f2 3e
a0 c7 05 fd c0 07 00 1c 02 00 00 48 c7 05 fe c0 07 00 10 74 44 a0 c7 05
ec c0 07 00 00 00 02 02 <4c> 8b 40 18 31 c0 49 83 c0 60 e8 95 3b ed ff
f6 05 e2 a2 ee ff
Jun 25 13:00:26 nas-0-0 kernel: RIP [<ffffffffa03cb30c>]
lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: RSP <ffff88025b87fd08>
Jun 25 13:00:26 nas-0-0 kernel: CR2: 0000000000000018
Jun 25 13:00:26 nas-0-0 kernel: ---[ end trace 5f2e504657a55b57 ]---
Jun 25 13:00:26 nas-0-0 kernel: Kernel panic - not syncing: Fatal exception
Jun 25 13:00:26 nas-0-0 kernel: Pid: 30426, comm: mount.lustre Tainted:
G D --------------- 2.6.32-279.14.1.el6_lustre.x86_64 #1
Thanks!
Sumit
On 06/26/2015 10:27 AM, Sumit Mookerjee wrote:
> Hi!
>
> We run a 55 TB Lustre file system for our HPC users, with an MGS and
> an MDT on one node (nas-0-0), and four OSTs, two partitions on each of
> two nodes. After a year of stable operations, we had a major cooling
> system failure, and all the servers and clients crashed.
>
> Since then, have not been able to mount the MGS partition; the server
> simply crashes. I can mount the MDT, and the OSTs, but that does not
> help without the MGS running. I can mount the MGS with ldiskfs. An
> e2fsck on the MGS partition (also on the MDT and OST partitions) shows
> up no issues.
>
> Is there any way I can recover the MGS? I read that just doing a
> writeconf on the MDTs and the OSTs would regenerate the MGS config,
> but that does not seem to help (perhaps because the MGS cannot be
> mounted as lustre in the first place?).
>
> Have also tried creating a new MGS (mkfs.lustre --reformat --mgs) on a
> spare partition we had on nas-0-0. The mkfs seems to complete without
> errors, but the system crashes again when I try to mount this new
> partition as lustre.
>
> Is there any way to fix the problem without deleting all data from the
> MDT/OSTs (in short, starting afresh)?
> Am at my wit's end, and clearly do not know enough to understand what
> is going on. Any help much appreciated!
>
> Thank you.
>
> Sumit Mookerjee
>
--
-----------------------------------------------------------------------------
Sumit Mookerjee
Inter University Accelerator Centre
Aruna Asaf Ali Marg
New Delhi 110067
India
Phones: + 91 11 26893955, 26899232 ext. 8252
Fax: +91 11 26893666
E-mail: sumit at iuac.res.in
-----------------------------------------------------------------------------
More information about the lustre-discuss
mailing list