[lustre-discuss] MGS is not working in HA

Vicker, Darby (JSC-EG311) darby.vicker-1 at nasa.gov
Thu Oct 26 09:40:13 PDT 2017


You will have to recompile lustre with the patch in LU-8397.  The key for us was to look at the contents of /proc/fs/lustre/mgc/*/import.  Before the patch, failover_nids from that file was only showing one NID, despite mkfs.lustre/tunefs.lustre showing multiple service nodes configured.  See the mailing list thread and the LU for more details.

Looking back at this, our problems were related to multirail (using both IB and TCP).  Based on the mkfs.lustre commands you sent in your original email, that probably isn’t your issue.  Just for reference, this is what the mkfs.lustre command looks like for us.

     mkfs.lustre \
         --mgsnode=192.52.98.30 at tcp0,10.148.0.30 at o2ib0 \
         --mgsnode=192.52.98.31 at tcp0,10.148.0.31 at o2ib0 \
         --fsname=testfs \
         --backfstype=zfs \
         --reformat \
         --verbose \
         --mdt --index=0 \
         --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0,${LUSTRE_LOCAL_IB_IP}@o2ib0 \
         --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0,${LUSTRE_PEER_IB_IP}@o2ib0 \
         metadata/meta-test

Looking at this, you used a single --failover instead of a multiple --servicenode's.  The admin manual indicates --servicenode is preferred.  You might try that.  I still think looking at the import file I pointed you to above would be instructive regardless.



From: Ravi Konila <ravibhatk at gmail.com>
Reply-To: Ravi Konila <ravibhatk at gmail.com>
Date: Thursday, October 26, 2017 at 1:31 AM
To: Darby Vicker <darby.vicker-1 at nasa.gov>, "Mannthey, Keith" <keith.mannthey at intel.com>, Lustre Discuss <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] MGS is not working in HA

Hi

I am using Lustre 2.8 on RHEL 6.7.
As my application requires RHEL 6.7, I had to use Lustre 2.8.
Any suggestions?

Regards
Ravi Konila


From: Vicker, Darby (JSC-EG311)
Sent: Wednesday, October 25, 2017 11:51 PM
To: Mannthey, Keith ; Ravi Konila ; Lustre Discuss
Subject: Re: [lustre-discuss] MGS is not working in HA

Sorry – I also meant to say that the resolution went off the mailing list and was continued in LU-8397.  You can find the patch there.

From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Darby Vicker <darby.vicker-1 at nasa.gov>
Date: Wednesday, October 25, 2017 at 1:17 PM
To: "Mannthey, Keith" <keith.mannthey at intel.com>, Ravi Konila <ravibhatk at gmail.com>, Lustre Discuss <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] MGS is not working in HA

Which version of lustre are you using?  We initially has problem with this too when using failover with lustre 2.8 and 2.9.  We got a patch that fixed it and recent versions work fine for us.  We have a combined MGS/MDS so our scenario is a little different but this sounds very similar to our issue.

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-January/014125.html



From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of "Mannthey, Keith" <keith.mannthey at intel.com>
Date: Wednesday, October 25, 2017 at 11:30 AM
To: Ravi Konila <ravibhatk at gmail.com>, Lustre Discuss <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] MGS is not working in HA

Kavi,
  You may want to open a jira ticket with this error.  It looks like the mount command is only trying only the first nid of the mount command.

Jira is https://jira.hpdd.intel.com “LU” project.

I have seen Lustre Servers first mount behave like this but not client mounts.  It should try the first server, timeout and try the 2nd server.

Thanks,
Keith

From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Ravi Konila
Sent: Wednesday, October 25, 2017 5:07 AM
To: Lustre Discuss <lustre-discuss at lists.lustre.org>
Subject: [lustre-discuss] MGS is not working in HA

Hi
I have two servers for MGS/MDS and have configured it pacemaker for HA.
The command which I gave on first MGS/MDS mds01 is

mkfs.lustre --mgs --failnode 192.168.0.51 at o2ib --backfstype=ldiskfs /dev/mapper/mpathd

Next I created lustre filesystem for MDT
mkfs.lustre --mdt --fsname lhome --index 0 --mgsnode 192.168.0.50 at o2ib --mgsnode 192.168.0.51 at o2ib --servicenode 192.168.0.50 at o2ib --servicenode 192.168.0.51 at o2ib --backfstype=ldiskfs /dev/mapper/mpathb

Now, in my client, If I give
mount –t lustre 192.168.0.50 at o2ib:192.168.0.51 at o2ib:/lhome /home, it does not work and asks if MGS is running.
But if I give mount –t lustre 192.168.0.50 at o2ib:/lhome /home it works fine.

Also when my first MDS (mds01) is down, my client is not mounting lustre from 2nd MGS.
It says check if MGS is running?

Any help will be highly appreciated.

Regards
Ravi Konila
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171026/39e4c05e/attachment-0001.html>


More information about the lustre-discuss mailing list