[lustre-discuss] Kernel panic on mounting MGS

Patrick Farrell paf at cray.com
Fri Jun 26 08:14:37 PDT 2015


To be clear, are you saying you *did* do a write conf on the MDT and OSTs?

In this situation, I would suggest this - which you may have done already:

Write conf all of the other targets, MDT and OSTs, and your MGS. Then 
try to start.

If that fails, then:

Create a new MGS, then (or before you make the MGS - just be sure to do 
it again since your previous attempt to start) write conf the MDT and 
all of the OSTs, then try to start.

This sort of crash is almost certainly some sort of invalid data on the 
MGS...  It might also be possible to identify the particular data and 
remove it, depending where the crash is.  That's a bit trickier.

- Patrick

On 06/25/2015 11:57 PM, Sumit Mookerjee wrote:
> Hi!
> We run a 55 TB Lustre file system for our HPC users, with an MGS and 
> an MDT on one node (nas-0-0), and four OSTs, two partitions on each of 
> two nodes. After a year of stable operations, we had a major cooling 
> system failure, and all the servers and clients crashed.
> Since then, have not been able to mount the MGS partition; the server 
> simply crashes. I can mount the MDT, and the OSTs, but that does not 
> help without the MGS running. I can mount the MGS with ldiskfs. An 
> e2fsck on the MGS partition (also on the MDT and OST partitions) shows 
> up no issues.
> Is there any way I can recover the MGS? I read that just doing a 
> writeconf on the MDTs and the OSTs would regenerate the MGS config, 
> but that does not seem to help (perhaps because the MGS cannot be 
> mounted as lustre in the first place?).
> Have also tried creating a new MGS (mkfs.lustre --reformat --mgs) on a 
> spare partition we had on nas-0-0. The mkfs seems to complete without 
> errors, but the system crashes again when I try to mount this new 
> partition as lustre.
> Is there any way to fix the problem without deleting all data from the 
> MDT/OSTs (in short, starting afresh)?
> Am at my wit's end, and clearly do not know enough to understand what 
> is going on. Any help much appreciated!
> Thank you.
> Sumit Mookerjee

