[lustre-discuss] proper procedure after MDT kernel panic

Thu Aug 11 19:43:12 PDT 2016

> On Aug 11, 2016, at 5:42 AM, E.S. Rosenberg <esr+lustre at mail.hebrew.edu> wrote:
> 
> Our MDT suffered a kernel panic (which I will post separately), the OSSs stayed alive but the MDT was out for some time while nodes still tried to interact with lustre.
> 
> So I have several questions:
> a. what happens to processes/reading writing during such an event (if they already have handles on the OSS for instance that makes a difference)? I noticed several of our compute-nodes ended up filling their swap/RAM so I assume some level of caching is happening until the MDT returns….

In theory, the processes should just hang until the client can contact the server again.  In my experience, this works a large fraction of the time (I have occasionally done server reboots on a production file system that was in use in order to fix some problems), but I wouldn’t say it is 100% guaranteed.

> b. what is the best/proper procedure now to ensure filesystem integrity?
> Should I take the filesystem offline and run an lfsck first on the MDT then on the OSS?

If the MDS crashed, then you may was to check the MDT.  But if the OSS was still up, I don’t think there should be any problem with the OSTs that would require a fsck.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu