[lustre-discuss] MGS failover problem

Michael Di Domenico mdidomenico4 at gmail.com
Wed Jan 11 05:17:37 PST 2017

On Tue, Jan 10, 2017 at 11:32 AM, Vicker, Darby (JSC-EG311)
<darby.vicker-1 at nasa.gov> wrote:
> One other thought comes to mind.  We are using the init.d scripts (i.e. /etc/init.d/{lustre,lnet} and /etc/ldev.conf.  We have lnet chkconfig’ed on so lnet is starting on boot on all servers.  But ‘lustre’ is chkconfig’ed off so that if a server reboots for whatever reason we don’t get into a situation where we multi-mount.  On a clean boot we have to manually mount the MDT/OST’s (i.e. do a “service lustre start”).  To do the failover we do the “/etc/init.d/lustre stop local” on the primary and “/etc/init.d/lustre start foreign” on the secondary to do the failover.  What is the right thing to do with lnet on failover?  Should it be stopped on the primary node before doing a failover to the secondary node?  This is the state of the pro

I'm certainly no lustre expert, but i would suspect you want lnet to
be stopped on the primary node if you failed over to the secondary.
historically lustre is a STONITH based failover system.  therefore i
would expect that if you "failed over" from one node to another the
primary node is effectively powered off.  i can certainly believe that
there's some code in lustre that checks lnet and if its up tries to do
something.  which could be the source of the error messages you're

but i'm not an expert, so i could be way off base.

