[Lustre-discuss] Heartbeat problem

Patrice Hamelin patrice.hamelin at ec.gc.ca
Thu Dec 22 07:38:21 PST 2011


Hi,

   I have a heartbeat problem while trying automatic failover.  Manual 
failover works great, unmounting a  partitition from an OSS and 
remounting it on another one makes the clients recover.  It all starts 
with this error:

Filesystem[7650]:       2011/12/22_14:36:05 ERROR: Couldn't mount 
filesystem /dev/mpath/colosse4-lun60-sata on /mnt/data/clun60
Filesystem[7639]:       2011/12/22_14:36:05 ERROR:  Generic error

   As a result, the failover OSS is the wrong one and the clients stays 
in this state forever:

sata-OST0000_UUID   : Resource temporarily unavailable

   Here is my heartbeat config:

[root at ib3-st02 ~]# cat /etc/ha.d/ha.cf
# log file settings
# write debug output to /var/log/ha-debug
debugfile /var/log/ha-debug
# write log messages to /var/log/ha-log
logfile /var/log/ha-log
# use syslog to write to logfiles
logfacility local0
# set some time-outs. these values are only recommendations, which
# depend e.g. on the OSS load
# send keep-alive packages every 2 seconds
keepalive 2
# wait 90 seconds before declaring a node dead
deadtime 90
# write a warning to the logfile after 30 seconds without an answer
# from the failover node
warntime 30
# wait for 120 seconds before declaring a node dead after heartbeat
# is brought up
initdead 120
# define communication channels
# use port 12345 to communicate with fail-over node
udpport 12345
# use network interfaces eth0 and ib0 to detect a failed node
bcast eth0 bond0
# Use manual failback
auto_failback off
# node names in this failover-pair. These names must match the
# output of `hostname`
node ib3-st01
node ib3-st02
node ib3-st03
node ib3-st04

[root at ib3-st02 ~]# cat /etc/ha.d/haresources
ib3-st01 Filesystem::/dev/emcssd-1/mdt-sata::/mnt/mdt-colosse::lustre
ib3-st01 
Filesystem::/dev/mpath/colosse4-lun53-sata::/mnt/data/clun53::lustre
ib3-st02 
Filesystem::/dev/mpath/colosse4-lun54-sata::/mnt/data/clun54::lustre
ib3-st03 
Filesystem::/dev/mpath/colosse4-lun55-sata::/mnt/data/clun55::lustre
ib3-st04 
Filesystem::/dev/mpath/colosse4-lun56-sata::/mnt/data/clun56::lustre
ib3-st01 
Filesystem::/dev/mpath/colosse4-lun57-sata::/mnt/data/clun57::lustre
ib3-st02 
Filesystem::/dev/mpath/colosse4-lun58-sata::/mnt/data/clun58::lustre
ib3-st03 
Filesystem::/dev/mpath/colosse4-lun59-sata::/mnt/data/clun59::lustre
ib3-st04 
Filesystem::/dev/mpath/colosse4-lun60-sata::/mnt/data/clun60::lustre


   It is all the same on all OSS's.

Does anybody ever encounter  that problem?
Thanks for help.




-- 
Patrice Hamelin
Specialiste sénior en systèmes d'exploitation | Senior OS specialist
Environnement Canada | Environment Canada
2121, route Transcanadienne | 2121 Transcanada Highway
Dorval, QC H9P 1J3
Téléphone | Telephone 514-421-5303
Télécopieur | Facsimile 514-421-7231
Gouvernement du Canada | Government of Canada




More information about the lustre-discuss mailing list