[lustre-discuss] Kernel panic on mounting MGS

Sumit Mookerjee sumit at iuac.res.in
Thu Jun 25 21:57:21 PDT 2015


We run a 55 TB Lustre file system for our HPC users, with an MGS and an 
MDT on one node (nas-0-0), and four OSTs, two partitions on each of two 
nodes. After a year of stable operations, we had a major cooling system 
failure, and all the servers and clients crashed.

Since then, have not been able to mount the MGS partition; the server 
simply crashes. I can mount the MDT, and the OSTs, but that does not 
help without the MGS running. I can mount the MGS with ldiskfs. An 
e2fsck on the MGS partition (also on the MDT and OST partitions) shows 
up no issues.

Is there any way I can recover the MGS? I read that just doing a 
writeconf on the MDTs and the OSTs would regenerate the MGS config, but 
that does not seem to help (perhaps because the MGS cannot be mounted as 
lustre in the first place?).

Have also tried creating a new MGS (mkfs.lustre --reformat --mgs) on a 
spare partition we had on nas-0-0. The mkfs seems to complete without 
errors, but the system crashes again when I try to mount this new 
partition as lustre.

Is there any way to fix the problem without deleting all data from the 
MDT/OSTs (in short, starting afresh)?
Am at my wit's end, and clearly do not know enough to understand what is 
going on. Any help much appreciated!

Thank you.

Sumit Mookerjee

Sumit Mookerjee

Inter University Accelerator Centre
Aruna Asaf Ali Marg
New Delhi 110067

E-mail: sumit at iuac.res.in

More information about the lustre-discuss mailing list