[lustre-discuss] MGS failover problem

Wed Jan 11 09:30:54 PST 2017

> The question I have in this is how long are you waiting, and how are you
> determining that lnet has hung?

The example I just sent today I waited about 10 minutes.  But the other day it looks like I waited about 20 minutes before rebooting as I couldn't kill lnet.  I'm calling it hung because even 10 minutes seems excessive.  Also because of the stack trace.  

> How are you specifying --failnode for your configuration?  If you could
> rune tunefs.lustre on the MDT/MGS and an OST, that would be very helpful.

We are not using --failnode, we are using --servicenode since the admin manual indicates --failnode has some disadvantages over --servicenode.  See the original post for how we formatted the MDT and OST's (not optimal).  

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-January/014125.html

But service node options were corrected with a tunefs.lustre command – see this:

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-January/014129.html

> Finally, how are you specifying the mount string on your various clients?

mount -t lustre 192.52.98.30 at tcp:192.52.98.31 at tcp:/hpfs-fsl /tmp/lustre_test/

But the clients seem to be just fine – its the OSS's that don't seem to be picking up the new MGS.