[Lustre-discuss] MDT Failover not functioning properly with Lustre FS

Wed Feb 20 16:52:01 PST 2008

I've never used heartbeat before but just from reading what you wrote  
I see a couple things that could be wrong. The first is in /etc/ha.d/ 
haresoureces (pasted lines below)

> lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
> lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre

If I understand this correctly, you're telling it to look for the  
filesystem on /dev/sdc...but your mkfs command created the filesystem  
on /dev/sdc1. I don't think thats' the main problem here, however. I  
believe the problem is that you have 192.168.100.1 assigned to eth1  
already. It's trying to re-assign it to eth1:0 which will cause it to  
fail. Try giving another ip to eth1

On Feb 20, 2008, at 5:53 PM, Chadha, Narjit wrote:

> In short, I am working to failover the MDT to another node. I have  
> activated heartbeat and it appears to be running properly. However,  
> even if other resources failover, the Lustre filesystem does not  
> appear to. The mount point on lustre01 (head mdt), does not transfer  
> to lustre02 (slave mdt) given a failure or a simple ‘/usr/lib/ 
> heartbeat/hb_takeover foreign’ from the backup mdt server.
>
> I am working with 2 nodes, both of which can see the same device, / 
> dev/sdc1. I ensured that the device could be mounted by either  
> server. The storage is Fibre Channel, if anybody is curious.  
> Heartbeat was configured and set up as below
>
> /etc/ha.d/authkeys was set up (simple and the same on both servers).  
> In /usr/lib/ocf/resource.d/heartbeat/Filesystem, I included the  
> lustre filesystem as follows:
>
> if [ $blockdevice = "yes" ]; then
>                 if [ "$DEVICE" != "/dev/null" -a ! -b "$DEVICE" ] ;  
> then
>                         ocf_log err "Couldn't find device [$DEVICE].  
> Expected /dev/??? to exist"
>                         exit $OCF_ERR_ARGS
>                 fi
>
>                 if
>                   case $FSTYPE in
>                     ext3|reiserfs|reiser4|lustre|nss|xfs|jfs|vfat| 
> fat|nfs|cifs|smbfs|ocfs2)     false;;
>                      
> *)                                                                   
> true;;
>                   esac
>                 then
>                         ocf_log info  "Starting filesystem check on  
> $DEVICE"
>                         if [ -z "$FSTYPE" ]; then
>                                 $FSCK -a $DEVICE
> ---etc
> (this was the same on both servers)
>
> Nothing was changed in /etc/ha.d/resource.d/Filesystem, as /usr/lib/ 
> ocf/resource.d/heartbeat/Filesystem was used instead.
>
>
> /etc/ha.d/haresoureces contains the names and filesystems of the two  
> servers. lustre01 is the primary mds server and lustre02 is the backup
> (same on both servers)
>
> lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
> lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre
>
>
>
> The /etc/ha.d/ha.cf file on both servers is:
>
> debugfile       /var/log/ha-debug
> logfile         /var/log/ha-log
> logfacility     local0
> keepalive       2
> deadtime        15
> initdead        60
> udpport         694
> bcast           eth1
> auto_failback   off
> node lustre01
> node lustre02
>
> I have tried various orderings of starting heartbeat, but generally,  
> I first format the lustre01 node using ‘mkfs.lustre --mdt --mgs -- 
> fsname mylustre --failnode=lustre02 at tcp --reformat /dev/sdc1 ‘. This  
> works fine. Following this step, I mount the primary node (as shown  
> on p.76 of the Lustre 1.6 manual), ‘mount –t lustre /dev/sdc1 / 
> lustremds’. /lustremds exists on both nodes. After this the ‘service  
> heartbeat start’ command is issued on both nodes. The results are as  
> follows:
>
> Lustre01 (primary mdt)
> heartbeat[3727]: 2008/02/20_16:30:34 info: **************************
> heartbeat[3727]: 2008/02/20_16:30:34 info: Configuration validated.  
> Starting heartbeat 2.1.2
> heartbeat[3728]: 2008/02/20_16:30:34 info: heartbeat: version 2.1.2
> heartbeat[3728]: 2008/02/20_16:30:34 info: Heartbeat generation:  
> 1200690464
> heartbeat[3728]: 2008/02/20_16:30:34 info:  
> G_main_add_TriggerHandler: Added signal manual handler
> heartbeat[3728]: 2008/02/20_16:30:34 info:  
> G_main_add_TriggerHandler: Added signal manual handler
> heartbeat[3728]: 2008/02/20_16:30:34 info: Removing /var/run/ 
> heartbeat/rsctmp failed, recreating.
> heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast  
> heartbeat started on port 694 (694) interface eth1
> heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast  
> heartbeat closed on port 694 interface eth1 - Status: 1
> heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_SignalHandler:  
> Added signal handler for signal 17
> heartbeat[3728]: 2008/02/20_16:30:34 info: Local status now set to:  
> 'up'
> heartbeat[3728]: 2008/02/20_16:30:35 info: Link lustre01:eth1 up.
> heartbeat[3728]: 2008/02/20_16:30:40 info: Link lustre02:eth1 up.
> heartbeat[3728]: 2008/02/20_16:30:40 info: Status update for node  
> lustre02: status up
> harc[3735]:     2008/02/20_16:30:40 info: Running /etc/ha.d/rc.d/ 
> status status
> heartbeat[3728]: 2008/02/20_16:30:41 info: Comm_now_up(): updating  
> status to active
> heartbeat[3728]: 2008/02/20_16:30:41 info: Local status now set to:  
> 'active'
> heartbeat[3728]: 2008/02/20_16:30:41 WARN: G_CH_dispatch_int:  
> Dispatch function for read child took too long to execute: 210 ms (>  
> 50 ms) (GSource: 0x8432df8)
> heartbeat[3728]: 2008/02/20_16:30:41 info: Status update for node  
> lustre02: status active
> harc[3752]:     2008/02/20_16:30:41 info: Running /etc/ha.d/rc.d/ 
> status status
> heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource  
> transition completed.
> heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource  
> transition completed.
> heartbeat[3728]: 2008/02/20_16:30:52 info: Initial resource  
> acquisition complete (T_RESOURCES(us))
> IPaddr[3804]:   2008/02/20_16:30:52 INFO:  Resource is stopped
> heartbeat[3768]: 2008/02/20_16:30:52 info: Local Resource  
> acquisition completed.
> harc[3843]:     2008/02/20_16:30:52 info: Running /etc/ha.d/rc.d/ip- 
> request-resp ip-request-resp
> ip-request-resp[3843]:  2008/02/20_16:30:52 received ip-request-resp  
> 192.168.100.1 OK yes
> ResourceManager[3864]:  2008/02/20_16:30:52 info: Acquiring resource  
> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
> IPaddr[3891]:   2008/02/20_16:30:52 INFO:  Resource is stopped
> ResourceManager[3864]:  2008/02/20_16:30:52 info: Running /etc/ha.d/ 
> resource.d/IPaddr 192.168.100.1 start
> IPaddr[3967]:   2008/02/20_16:30:52 INFO: Using calculated nic for  
> 192.168.100.1: eth1
> IPaddr[3967]:   2008/02/20_16:30:52 INFO: Using calculated netmask  
> for 192.168.100.1: 255.255.255.0
> IPaddr[3967]:   2008/02/20_16:30:52 INFO: eval ifconfig eth1:0  
> 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255
> IPaddr[3967]:   2008/02/20_16:30:52 ERROR: Could not add  
> 192.168.100.1 to eth1: 255
> IPaddr[3950]:   2008/02/20_16:30:52 ERROR:  Unknown error: 255
> ResourceManager[3864]:  2008/02/20_16:30:52 ERROR: Return code 1  
> from /etc/ha.d/resource.d/IPaddr
> ResourceManager[3864]:  2008/02/20_16:30:52 CRIT: Giving up  
> resources due to failure of 192.168.100.1
> ResourceManager[3864]:  2008/02/20_16:30:52 info: Releasing resource  
> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
> ResourceManager[3864]:  2008/02/20_16:30:52 info: Running /etc/ha.d/ 
> resource.d/Filesystem /dev/sdc /lustremds lustre stop
> Filesystem[4103]:       2008/02/20_16:30:52 INFO: Running stop for / 
> dev/sdc on /lustremds
> Filesystem[4092]:       2008/02/20_16:30:52 INFO:  Success
> ResourceManager[3864]:  2008/02/20_16:30:52 info: Running /etc/ha.d/ 
> resource.d/IPaddr 192.168.100.1 stop
> IPaddr[4176]:   2008/02/20_16:30:52 INFO:  Success
> heartbeat[3728]: 2008/02/20_16:31:22 info: lustre02 wants to go  
> standby [foreign]
> hb_standby[4227]:       2008/02/20_16:31:23 Going standby [foreign].
> heartbeat[3728]: 2008/02/20_16:31:23 WARN: Standby in progress- new  
> request from lustre01 ignored [3600 seconds left]
> heartbeat[3728]: 2008/02/20_16:31:23 info: standby: acquire  
> [foreign] resources from lustre02
> heartbeat[4241]: 2008/02/20_16:31:23 info: acquire local HA  
> resources (standby).
> ResourceManager[4254]:  2008/02/20_16:31:23 info: Acquiring resource  
> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
> IPaddr[4281]:   2008/02/20_16:31:23 INFO:  Resource is stopped
> ResourceManager[4254]:  2008/02/20_16:31:23 info: Running /etc/ha.d/ 
> resource.d/IPaddr 192.168.100.1 start
> IPaddr[4357]:   2008/02/20_16:31:23 INFO: Using calculated nic for  
> 192.168.100.1: eth1
> IPaddr[4357]:   2008/02/20_16:31:23 INFO: Using calculated netmask  
> for 192.168.100.1: 255.255.255.0
> IPaddr[4357]:   2008/02/20_16:31:23 INFO: eval ifconfig eth1:1  
> 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255
> IPaddr[4357]:   2008/02/20_16:31:23 ERROR: Could not add  
> 192.168.100.1 to eth1: 255
> IPaddr[4340]:   2008/02/20_16:31:23 ERROR:  Unknown error: 255
> ResourceManager[4254]:  2008/02/20_16:31:24 ERROR: Return code 1  
> from /etc/ha.d/resource.d/IPaddr
> ResourceManager[4254]:  2008/02/20_16:31:24 CRIT: Giving up  
> resources due to failure of 192.168.100.1
> ResourceManager[4254]:  2008/02/20_16:31:24 info: Releasing resource  
> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
> ResourceManager[4254]:  2008/02/20_16:31:24 info: Running /etc/ha.d/ 
> resource.d/Filesystem /dev/sdc /lustremds lustre stop
> Filesystem[4491]:       2008/02/20_16:31:24 INFO: Running stop for / 
> dev/sdc on /lustremds
> Filesystem[4480]:       2008/02/20_16:31:24 INFO:  Success
> ResourceManager[4254]:  2008/02/20_16:31:24 info: Running /etc/ha.d/ 
> resource.d/IPaddr 192.168.100.1 stop
> IPaddr[4564]:   2008/02/20_16:31:24 INFO:  Success
> heartbeat[4241]: 2008/02/20_16:31:24 info: local HA resource  
> acquisition completed (standby).
> heartbeat[3728]: 2008/02/20_16:31:24 info: Standby resource  
> acquisition done [foreign].
> heartbeat[3728]: 2008/02/20_16:31:24 info: remote resource  
> transition completed.
> hb_standby[4621]:       2008/02/20_16:31:54 Going standby [foreign].
> heartbeat[3728]: 2008/02/20_16:31:54 info: lustre01 wants to go  
> standby [foreign]
> heartbeat[3728]: 2008/02/20_16:31:54 info: standby: lustre02 can  
> take our foreign resources
> heartbeat[4635]: 2008/02/20_16:31:54 info: give up foreign HA  
> resources (standby).
> ResourceManager[4648]:  2008/02/20_16:31:55 info: Releasing resource  
> group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre
> ResourceManager[4648]:  2008/02/20_16:31:55 info: Running /etc/ha.d/ 
> resource.d/Filesystem /dev/sdc /lustremds lustre stop
> Filesystem[4697]:       2008/02/20_16:31:55 INFO: Running stop for / 
> dev/sdc on /lustremds
> Filesystem[4686]:       2008/02/20_16:31:55 INFO:  Success
> ResourceManager[4648]:  2008/02/20_16:31:55 info: Running /etc/ha.d/ 
> resource.d/IPaddr 192.168.100.2 stop
> IPaddr[4770]:   2008/02/20_16:31:55 INFO:  Success
> heartbeat[4635]: 2008/02/20_16:31:55 info: foreign HA resource  
> release completed (standby).
> heartbeat[3728]: 2008/02/20_16:31:55 info: Local standby process  
> completed [foreign].
> heartbeat[3728]: 2008/02/20_16:31:56 WARN: 1 lost packet(s) for  
> [lustre02] [58:60]
> heartbeat[3728]: 2008/02/20_16:31:56 info: remote resource  
> transition completed.
> heartbeat[3728]: 2008/02/20_16:31:56 info: No pkts missing from  
> lustre02!
> heartbeat[3728]: 2008/02/20_16:31:56 info: Other node completed  
> standby takeover of foreign resources.
> heartbeat[3728]: 2008/02/20_16:32:26 info: lustre02 wants to go  
> standby [foreign]
> heartbeat[3728]: 2008/02/20_16:32:27 info: standby: acquire  
> [foreign] resources from lustre02
> heartbeat[4811]: 2008/02/20_16:32:27 info: acquire local HA  
> resources (standby).
> ResourceManager[4824]:  2008/02/20_16:32:27 info: Acquiring resource  
> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
> IPaddr[4851]:   2008/02/20_16:32:27 INFO:  Resource is stopped
> ResourceManager[4824]:  2008/02/20_16:32:27 info: Running /etc/ha.d/ 
> resource.d/IPaddr 192.168.100.1 start
> IPaddr[4927]:   2008/02/20_16:32:27 INFO: Using calculated nic for  
> 192.168.100.1: eth1
> IPaddr[4927]:   2008/02/20_16:32:27 INFO: Using calculated netmask  
> for 192.168.100.1: 255.255.255.0
> IPaddr[4927]:   2008/02/20_16:32:27 INFO: eval ifconfig eth1:2  
> 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255
> IPaddr[4927]:   2008/02/20_16:32:27 ERROR: Could not add  
> 192.168.100.1 to eth1: 255
> IPaddr[4910]:   2008/02/20_16:32:27 ERROR:  Unknown error: 255
> ResourceManager[4824]:  2008/02/20_16:32:27 ERROR: Return code 1  
> from /etc/ha.d/resource.d/IPaddr
> ResourceManager[4824]:  2008/02/20_16:32:27 CRIT: Giving up  
> resources due to failure of 192.168.100.1
> ResourceManager[4824]:  2008/02/20_16:32:27 info: Releasing resource  
> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre
> ResourceManager[4824]:  2008/02/20_16:32:27 info: Running /etc/ha.d/ 
> resource.d/Filesystem /dev/sdc /lustremds lustre stop
> Filesystem[5061]:       2008/02/20_16:32:27 INFO: Running stop for / 
> dev/sdc on /lustremds
> Filesystem[5050]:       2008/02/20_16:32:27 INFO:  Success
> ResourceManager[4824]:  2008/02/20_16:32:27 info: Running /etc/ha.d/ 
> resource.d/IPaddr 192.168.100.1 stop
> IPaddr[5134]:   2008/02/20_16:32:27 INFO:  Success
> heartbeat[4811]: 2008/02/20_16:32:27 info: local HA resource  
> acquisition completed (standby).
> heartbeat[3728]: 2008/02/20_16:32:27 info: Standby resource  
> acquisition done [foreign].
> heartbeat[3728]: 2008/02/20_16:32:28 info: remote resource  
> transition completed.
>
> Lustre02 (secondary mdt)
> heartbeat[4833]: 2008/02/20_16:39:24 info: lustre01 wants to go  
> standby [foreign]
> heartbeat[4833]: 2008/02/20_16:39:25 info: standby: acquire  
> [foreign] resources from lustre01
> heartbeat[6658]: 2008/02/20_16:39:25 info: acquire local HA  
> resources (standby).
> ResourceManager[6671]:  2008/02/20_16:39:25 info: Acquiring resource  
> group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre
>
> However, I cannot see Lustre mounted on either device. Does anybody  
> know what is the issue here? This statement concerns me:
> IPaddr[4357]:   2008/02/20_16:31:23 ERROR: Could not add  
> 192.168.100.1 to eth1: 255
> IPaddr[4340]:   2008/02/20_16:31:23 ERROR:  Unknown error: 255
>
>
> BTW, 192.168.100.1 is the eth1 address on lustre01 (main mdt)
>
> Thanks
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080220/9bbad756/attachment.htm>