[Lustre-discuss] Heartbeat problem

Fri Dec 23 05:18:57 PST 2011

Hi,

we had had the same problem. We 'fixed' it by increasing the start
parameter in Linux-HA
script /usr/lib/ocf/resource.d/heartbeat/Filesystem

        ...
        <action name="start" timeout="300" />
        ...

If you use pacemaker or RH cluster suite (although your config dir looks
like linux-ha) there's probably a similar parameter.

Cheers

-Frank

On Thu, 2011-12-22 at 16:38 +0100, Patrice Hamelin wrote:
> Hi,
>
>    I have a heartbeat problem while trying automatic failover.  Manual
> failover works great, unmounting a  partitition from an OSS and
> remounting it on another one makes the clients recover.  It all starts
> with this error:
>
> Filesystem[7650]:       2011/12/22_14:36:05 ERROR: Couldn't mount
> filesystem /dev/mpath/colosse4-lun60-sata on /mnt/data/clun60
> Filesystem[7639]:       2011/12/22_14:36:05 ERROR:  Generic error
>
>    As a result, the failover OSS is the wrong one and the clients stays
> in this state forever:
>
> sata-OST0000_UUID   : Resource temporarily unavailable
>
>    Here is my heartbeat config:
>
> [root at ib3-st02 ~]# cat /etc/ha.d/ha.cf
> # log file settings
> # write debug output to /var/log/ha-debug
> debugfile /var/log/ha-debug
> # write log messages to /var/log/ha-log
> logfile /var/log/ha-log
> # use syslog to write to logfiles
> logfacility local0
> # set some time-outs. these values are only recommendations, which
> # depend e.g. on the OSS load
> # send keep-alive packages every 2 seconds
> keepalive 2
> # wait 90 seconds before declaring a node dead
> deadtime 90
> # write a warning to the logfile after 30 seconds without an answer
> # from the failover node
> warntime 30
> # wait for 120 seconds before declaring a node dead after heartbeat
> # is brought up
> initdead 120
> # define communication channels
> # use port 12345 to communicate with fail-over node
> udpport 12345
> # use network interfaces eth0 and ib0 to detect a failed node
> bcast eth0 bond0
> # Use manual failback
> auto_failback off
> # node names in this failover-pair. These names must match the
> # output of `hostname`
> node ib3-st01
> node ib3-st02
> node ib3-st03
> node ib3-st04
>
> [root at ib3-st02 ~]# cat /etc/ha.d/haresources
> ib3-st01 Filesystem::/dev/emcssd-1/mdt-sata::/mnt/mdt-colosse::lustre
> ib3-st01
> Filesystem::/dev/mpath/colosse4-lun53-sata::/mnt/data/clun53::lustre
> ib3-st02
> Filesystem::/dev/mpath/colosse4-lun54-sata::/mnt/data/clun54::lustre
> ib3-st03
> Filesystem::/dev/mpath/colosse4-lun55-sata::/mnt/data/clun55::lustre
> ib3-st04
> Filesystem::/dev/mpath/colosse4-lun56-sata::/mnt/data/clun56::lustre
> ib3-st01
> Filesystem::/dev/mpath/colosse4-lun57-sata::/mnt/data/clun57::lustre
> ib3-st02
> Filesystem::/dev/mpath/colosse4-lun58-sata::/mnt/data/clun58::lustre
> ib3-st03
> Filesystem::/dev/mpath/colosse4-lun59-sata::/mnt/data/clun59::lustre
> ib3-st04
> Filesystem::/dev/mpath/colosse4-lun60-sata::/mnt/data/clun60::lustre
>
>
>    It is all the same on all OSS's.
>
> Does anybody ever encounter  that problem?
> Thanks for help.
>
>
>
>

------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------