[lustre-discuss] MGS failover problem

Vicker, Darby (JSC-EG311) darby.vicker-1 at nasa.gov
Wed Jan 11 09:30:54 PST 2017

> The question I have in this is how long are you waiting, and how are you
> determining that lnet has hung?

The example I just sent today I waited about 10 minutes.  But the other day it looks like I waited about 20 minutes before rebooting as I couldn't kill lnet.  I'm calling it hung because even 10 minutes seems excessive.  Also because of the stack trace.  

> How are you specifying --failnode for your configuration?  If you could
> rune tunefs.lustre on the MDT/MGS and an OST, that would be very helpful.

We are not using --failnode, we are using --servicenode since the admin manual indicates --failnode has some disadvantages over --servicenode.  See the original post for how we formatted the MDT and OST's (not optimal).  


But service node options were corrected with a tunefs.lustre command – see this:


> Finally, how are you specifying the mount string on your various clients?

mount -t lustre at tcp: at tcp:/hpfs-fsl /tmp/lustre_test/

But the clients seem to be just fine – its the OSS's that don't seem to be picking up the new MGS.  

More information about the lustre-discuss mailing list