[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
Chadha, Narjit
Narjit.Chadha at necam.com
Thu Feb 21 13:14:11 PST 2008
Thanks,
>From what I found out, a single dummy address must be used in
/etc/ha.d/haresources. That got rid of the ip conflict. I seem to be
able to failover the mds now. The only thing left is to be able to mount
the failover mds configuration on the OSS. The sytax:
mkfs.lustre --ost -fsname=mylustre -mgsnid=lustre0[1-2] /dev/sdb1
gives an error with parsing node names, yet this is what the Lustre
manual indicates to do. A ',' separation of node names also does not
work and will yield this type of error upon mounting:
mount.lustre: mount /dev/sdb1 at /mnt/lustrefs failed: Input/output
error
Is the MGS running?
I have seen a number of people having the same problem, but have not
seen a resolution posted yet.
Regards,
N.
________________________________
From: Aaron Knister [mailto:aaron at iges.org]
Sent: Wednesday, February 20, 2008 6:52 PM
To: Chadha, Narjit
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] MDT Failover not functioning properly with
Lustre FS
I've never used heartbeat before but just from reading what you wrote I
see a couple things that could be wrong. The first is in
/etc/ha.d/haresoureces (pasted lines below)
lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre
If I understand this correctly, you're telling it to look for the
filesystem on /dev/sdc...but your mkfs command created the filesystem on
/dev/sdc1. I don't think thats' the main problem here, however. I
believe the problem is that you have 192.168.100.1 assigned to eth1
already. It's trying to re-assign it to eth1:0 which will cause it to
fail. Try giving another ip to eth1
On Feb 20, 2008, at 5:53 PM, Chadha, Narjit wrote:
In short, I am working to failover the MDT to another node. I have
activated heartbeat and it appears to be running properly. However, even
if other resources failover, the Lustre filesystem does not appear to.
The mount point on lustre01 (head mdt), does not transfer to lustre02
(slave mdt) given a failure or a simple '/usr/lib/heartbeat/hb_takeover
foreign' from the backup mdt server.
I am working with 2 nodes, both of which can see the same device,
/dev/sdc1. I ensured that the device could be mounted by either server.
The storage is Fibre Channel, if anybody is curious. Heartbeat was
configured and set up as below
/etc/ha.d/authkeys was set up (simple and the same on both servers). In
/usr/lib/ocf/resource.d/heartbeat/Filesystem, I included the lustre
filesystem as follows:
if [ $blockdevice = "yes" ]; then
if [ "$DEVICE" != "/dev/null" -a ! -b "$DEVICE" ] ; then
ocf_log err "Couldn't find device [$DEVICE].
Expected /dev/??? to exist"
exit $OCF_ERR_ARGS
fi
if
case $FSTYPE in
ext3|reiserfs|reiser4|lustre|nss|xfs|jfs|vfat|fat|nfs|cifs|smbfs|ocfs2)
false;;
*)
true;;
esac
then
ocf_log info "Starting filesystem check on
$DEVICE"
if [ -z "$FSTYPE" ]; then
$FSCK -a $DEVICE
---etc
(this was the same on both servers)
Nothing was changed in /etc/ha.d/resource.d/Filesystem, as
/usr/lib/ocf/resource.d/heartbeat/Filesystem was used instead.
/etc/ha.d/haresoureces contains the names and filesystems of the two
servers. lustre01 is the primary mds server and lustre02 is the backup
(same on both servers)
lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre
The /etc/ha.d/ha.cf file on both servers is:
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 15
initdead 60
udpport 694
bcast eth1
auto_failback off
node lustre01
node lustre02
I have tried various orderings of starting heartbeat, but generally, I
first format the lustre01 node using 'mkfs.lustre --mdt --mgs --fsname
mylustre --failnode=lustre02 at tcp --reformat /dev/sdc1 '. This works
fine. Following this step, I mount the primary node (as shown on p.76 of
the Lustre 1.6 manual), 'mount -t lustre /dev/sdc1 /lustremds'.
/lustremds exists on both nodes. After this the 'service heartbeat
start' command is issued on both nodes. The results are as follows:
Lustre01 (primary mdt)
heartbeat[3727]: 2008/02/20_16:30:34 info: **************************
heartbeat[3727]: 2008/02/20_16:30:34 info: Configuration validated.
Starting heartbeat 2.1.2
heartbeat[3728]: 2008/02/20_16:30:34 info: heartbeat: version 2.1.2
heartbeat[3728]: 2008/02/20_16:30:34 info: Heartbeat generation:
1200690464
heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[3728]: 2008/02/20_16:30:34 info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast heartbeat
started on port 694 (694) interface eth1
heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast heartbeat
closed on port 694 interface eth1 - Status: 1
heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_SignalHandler:
Added signal handler for signal 17
heartbeat[3728]: 2008/02/20_16:30:34 info: Local status now set to: 'up'
heartbeat[3728]: 2008/02/20_16:30:35 info: Link lustre01:eth1 up.
heartbeat[3728]: 2008/02/20_16:30:40 info: Link lustre02:eth1 up.
heartbeat[3728]: 2008/02/20_16:30:40 info: Status update for node
lustre02: status up
harc[3735]: 2008/02/20_16:30:40 info: Running /etc/ha.d/rc.d/status
status
heartbeat[3728]: 2008/02/20_16:30:41 info: Comm_now_up(): updating
status to active
heartbeat[3728]: 2008/02/20_16:30:41 info: Local status now set to:
'active'
heartbeat[3728]: 2008/02/20_16:30:41 WARN: G_CH_dispatch_int: Dispatch
function for read child took too long to execute: 210 ms (> 50 ms)
(GSource: 0x8432df8)
heartbeat[3728]: 2008/02/20_16:30:41 info: Status update for node
lustre02: status active
harc[3752]: 2008/02/20_16:30:41 info: Running /etc/ha.d/rc.d/status
status
heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource transition
completed.
heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource transition
completed.
heartbeat[3728]: 2008/02/20_16:30:52 info: Initial resource acquisition
complete (T_RESOURCES(us))
IPaddr[3804]: 2008/02/20_16:30:52 INFO: Resource is stopped
heartbeat[3768]: 2008/02/20_16:30:52 info: Local Resource acquisition
completed.
harc[3843]: 2008/02/20_16:30:52 info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[3843]: 2008/02/20_16:30:52 received ip-request-resp
192.168.100.1 OK yes
ResourceManager[3864]: 2008/02/20_16:30:52 info: Acquiring resource
group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
IPaddr[3891]: 2008/02/20_16:30:52 INFO: Resource is stopped
ResourceManager[3864]: 2008/02/20_16:30:52 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.100.1 start
IPaddr[3967]: 2008/02/20_16:30:52 INFO: Using calculated nic for
192.168.100.1: eth1
IPaddr[3967]: 2008/02/20_16:30:52 INFO: Using calculated netmask for
192.168.100.1: 255.255.255.0
IPaddr[3967]: 2008/02/20_16:30:52 INFO: eval ifconfig eth1:0
192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255
IPaddr[3967]: 2008/02/20_16:30:52 ERROR: Could not add 192.168.100.1
to eth1: 255
IPaddr[3950]: 2008/02/20_16:30:52 ERROR: Unknown error: 255
ResourceManager[3864]: 2008/02/20_16:30:52 ERROR: Return code 1 from
/etc/ha.d/resource.d/IPaddr
ResourceManager[3864]: 2008/02/20_16:30:52 CRIT: Giving up resources
due to failure of 192.168.100.1
ResourceManager[3864]: 2008/02/20_16:30:52 info: Releasing resource
group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
ResourceManager[3864]: 2008/02/20_16:30:52 info: Running
/etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop
Filesystem[4103]: 2008/02/20_16:30:52 INFO: Running stop for
/dev/sdc on /lustremds
Filesystem[4092]: 2008/02/20_16:30:52 INFO: Success
ResourceManager[3864]: 2008/02/20_16:30:52 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.100.1 stop
IPaddr[4176]: 2008/02/20_16:30:52 INFO: Success
heartbeat[3728]: 2008/02/20_16:31:22 info: lustre02 wants to go standby
[foreign]
hb_standby[4227]: 2008/02/20_16:31:23 Going standby [foreign].
heartbeat[3728]: 2008/02/20_16:31:23 WARN: Standby in progress- new
request from lustre01 ignored [3600 seconds left]
heartbeat[3728]: 2008/02/20_16:31:23 info: standby: acquire [foreign]
resources from lustre02
heartbeat[4241]: 2008/02/20_16:31:23 info: acquire local HA resources
(standby).
ResourceManager[4254]: 2008/02/20_16:31:23 info: Acquiring resource
group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
IPaddr[4281]: 2008/02/20_16:31:23 INFO: Resource is stopped
ResourceManager[4254]: 2008/02/20_16:31:23 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.100.1 start
IPaddr[4357]: 2008/02/20_16:31:23 INFO: Using calculated nic for
192.168.100.1: eth1
IPaddr[4357]: 2008/02/20_16:31:23 INFO: Using calculated netmask for
192.168.100.1: 255.255.255.0
IPaddr[4357]: 2008/02/20_16:31:23 INFO: eval ifconfig eth1:1
192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255
IPaddr[4357]: 2008/02/20_16:31:23 ERROR: Could not add 192.168.100.1
to eth1: 255
IPaddr[4340]: 2008/02/20_16:31:23 ERROR: Unknown error: 255
ResourceManager[4254]: 2008/02/20_16:31:24 ERROR: Return code 1 from
/etc/ha.d/resource.d/IPaddr
ResourceManager[4254]: 2008/02/20_16:31:24 CRIT: Giving up resources
due to failure of 192.168.100.1
ResourceManager[4254]: 2008/02/20_16:31:24 info: Releasing resource
group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
ResourceManager[4254]: 2008/02/20_16:31:24 info: Running
/etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop
Filesystem[4491]: 2008/02/20_16:31:24 INFO: Running stop for
/dev/sdc on /lustremds
Filesystem[4480]: 2008/02/20_16:31:24 INFO: Success
ResourceManager[4254]: 2008/02/20_16:31:24 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.100.1 stop
IPaddr[4564]: 2008/02/20_16:31:24 INFO: Success
heartbeat[4241]: 2008/02/20_16:31:24 info: local HA resource acquisition
completed (standby).
heartbeat[3728]: 2008/02/20_16:31:24 info: Standby resource acquisition
done [foreign].
heartbeat[3728]: 2008/02/20_16:31:24 info: remote resource transition
completed.
hb_standby[4621]: 2008/02/20_16:31:54 Going standby [foreign].
heartbeat[3728]: 2008/02/20_16:31:54 info: lustre01 wants to go standby
[foreign]
heartbeat[3728]: 2008/02/20_16:31:54 info: standby: lustre02 can take
our foreign resources
heartbeat[4635]: 2008/02/20_16:31:54 info: give up foreign HA resources
(standby).
ResourceManager[4648]: 2008/02/20_16:31:55 info: Releasing resource
group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre
ResourceManager[4648]: 2008/02/20_16:31:55 info: Running
/etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop
Filesystem[4697]: 2008/02/20_16:31:55 INFO: Running stop for
/dev/sdc on /lustremds
Filesystem[4686]: 2008/02/20_16:31:55 INFO: Success
ResourceManager[4648]: 2008/02/20_16:31:55 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.100.2 stop
IPaddr[4770]: 2008/02/20_16:31:55 INFO: Success
heartbeat[4635]: 2008/02/20_16:31:55 info: foreign HA resource release
completed (standby).
heartbeat[3728]: 2008/02/20_16:31:55 info: Local standby process
completed [foreign].
heartbeat[3728]: 2008/02/20_16:31:56 WARN: 1 lost packet(s) for
[lustre02] [58:60]
heartbeat[3728]: 2008/02/20_16:31:56 info: remote resource transition
completed.
heartbeat[3728]: 2008/02/20_16:31:56 info: No pkts missing from
lustre02!
heartbeat[3728]: 2008/02/20_16:31:56 info: Other node completed standby
takeover of foreign resources.
heartbeat[3728]: 2008/02/20_16:32:26 info: lustre02 wants to go standby
[foreign]
heartbeat[3728]: 2008/02/20_16:32:27 info: standby: acquire [foreign]
resources from lustre02
heartbeat[4811]: 2008/02/20_16:32:27 info: acquire local HA resources
(standby).
ResourceManager[4824]: 2008/02/20_16:32:27 info: Acquiring resource
group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
IPaddr[4851]: 2008/02/20_16:32:27 INFO: Resource is stopped
ResourceManager[4824]: 2008/02/20_16:32:27 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.100.1 start
IPaddr[4927]: 2008/02/20_16:32:27 INFO: Using calculated nic for
192.168.100.1: eth1
IPaddr[4927]: 2008/02/20_16:32:27 INFO: Using calculated netmask for
192.168.100.1: 255.255.255.0
IPaddr[4927]: 2008/02/20_16:32:27 INFO: eval ifconfig eth1:2
192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255
IPaddr[4927]: 2008/02/20_16:32:27 ERROR: Could not add 192.168.100.1
to eth1: 255
IPaddr[4910]: 2008/02/20_16:32:27 ERROR: Unknown error: 255
ResourceManager[4824]: 2008/02/20_16:32:27 ERROR: Return code 1 from
/etc/ha.d/resource.d/IPaddr
ResourceManager[4824]: 2008/02/20_16:32:27 CRIT: Giving up resources
due to failure of 192.168.100.1
ResourceManager[4824]: 2008/02/20_16:32:27 info: Releasing resource
group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
ResourceManager[4824]: 2008/02/20_16:32:27 info: Running
/etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop
Filesystem[5061]: 2008/02/20_16:32:27 INFO: Running stop for
/dev/sdc on /lustremds
Filesystem[5050]: 2008/02/20_16:32:27 INFO: Success
ResourceManager[4824]: 2008/02/20_16:32:27 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.100.1 stop
IPaddr[5134]: 2008/02/20_16:32:27 INFO: Success
heartbeat[4811]: 2008/02/20_16:32:27 info: local HA resource acquisition
completed (standby).
heartbeat[3728]: 2008/02/20_16:32:27 info: Standby resource acquisition
done [foreign].
heartbeat[3728]: 2008/02/20_16:32:28 info: remote resource transition
completed.
Lustre02 (secondary mdt)
heartbeat[4833]: 2008/02/20_16:39:24 info: lustre01 wants to go standby
[foreign]
heartbeat[4833]: 2008/02/20_16:39:25 info: standby: acquire [foreign]
resources from lustre01
heartbeat[6658]: 2008/02/20_16:39:25 info: acquire local HA resources
(standby).
ResourceManager[6671]: 2008/02/20_16:39:25 info: Acquiring resource
group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre
However, I cannot see Lustre mounted on either device. Does anybody know
what is the issue here? This statement concerns me:
IPaddr[4357]: 2008/02/20_16:31:23 ERROR: Could not add 192.168.100.1
to eth1: 255
IPaddr[4340]: 2008/02/20_16:31:23 ERROR: Unknown error: 255
BTW, 192.168.100.1 is the eth1 address on lustre01 (main mdt)
Thanks
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies
(301) 595-7000
aaron at iges.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080221/159304ac/attachment.htm>
More information about the lustre-discuss
mailing list