[lustre-discuss] MGS failover problem
Vicker, Darby (JSC-EG311)
darby.vicker-1 at nasa.gov
Wed Jan 11 09:30:54 PST 2017
> The question I have in this is how long are you waiting, and how are you
> determining that lnet has hung?
The example I just sent today I waited about 10 minutes. But the other day it looks like I waited about 20 minutes before rebooting as I couldn't kill lnet. I'm calling it hung because even 10 minutes seems excessive. Also because of the stack trace.
> How are you specifying --failnode for your configuration? If you could
> rune tunefs.lustre on the MDT/MGS and an OST, that would be very helpful.
We are not using --failnode, we are using --servicenode since the admin manual indicates --failnode has some disadvantages over --servicenode. See the original post for how we formatted the MDT and OST's (not optimal).
But service node options were corrected with a tunefs.lustre command – see this:
> Finally, how are you specifying the mount string on your various clients?
mount -t lustre 126.96.36.199 at tcp:188.8.131.52 at tcp:/hpfs-fsl /tmp/lustre_test/
But the clients seem to be just fine – its the OSS's that don't seem to be picking up the new MGS.
More information about the lustre-discuss