[Lustre-discuss] Heartbeat problem

Patrice Hamelin patrice.hamelin at ec.gc.ca
Fri Dec 23 05:49:35 PST 2011


Thanks Franks,

   Works just great!

Greetings!

On 12/23/11 13:18, Frank Heckes wrote:
> Hi,
>
> we had had the same problem. We 'fixed' it by increasing the start
> parameter in Linux-HA
> script /usr/lib/ocf/resource.d/heartbeat/Filesystem
>
>          ...
>          <action name="start" timeout="300" />
>          ...
>
> If you use pacemaker or RH cluster suite (although your config dir looks
> like linux-ha) there's probably a similar parameter.
>
> Cheers
>
> -Frank
>
> On Thu, 2011-12-22 at 16:38 +0100, Patrice Hamelin wrote:
>> Hi,
>>
>>     I have a heartbeat problem while trying automatic failover.  Manual
>> failover works great, unmounting a  partitition from an OSS and
>> remounting it on another one makes the clients recover.  It all starts
>> with this error:
>>
>> Filesystem[7650]:       2011/12/22_14:36:05 ERROR: Couldn't mount
>> filesystem /dev/mpath/colosse4-lun60-sata on /mnt/data/clun60
>> Filesystem[7639]:       2011/12/22_14:36:05 ERROR:  Generic error
>>
>>     As a result, the failover OSS is the wrong one and the clients stays
>> in this state forever:
>>
>> sata-OST0000_UUID   : Resource temporarily unavailable
>>
>>     Here is my heartbeat config:
>>
>> [root at ib3-st02 ~]# cat /etc/ha.d/ha.cf
>> # log file settings
>> # write debug output to /var/log/ha-debug
>> debugfile /var/log/ha-debug
>> # write log messages to /var/log/ha-log
>> logfile /var/log/ha-log
>> # use syslog to write to logfiles
>> logfacility local0
>> # set some time-outs. these values are only recommendations, which
>> # depend e.g. on the OSS load
>> # send keep-alive packages every 2 seconds
>> keepalive 2
>> # wait 90 seconds before declaring a node dead
>> deadtime 90
>> # write a warning to the logfile after 30 seconds without an answer
>> # from the failover node
>> warntime 30
>> # wait for 120 seconds before declaring a node dead after heartbeat
>> # is brought up
>> initdead 120
>> # define communication channels
>> # use port 12345 to communicate with fail-over node
>> udpport 12345
>> # use network interfaces eth0 and ib0 to detect a failed node
>> bcast eth0 bond0
>> # Use manual failback
>> auto_failback off
>> # node names in this failover-pair. These names must match the
>> # output of `hostname`
>> node ib3-st01
>> node ib3-st02
>> node ib3-st03
>> node ib3-st04
>>
>> [root at ib3-st02 ~]# cat /etc/ha.d/haresources
>> ib3-st01 Filesystem::/dev/emcssd-1/mdt-sata::/mnt/mdt-colosse::lustre
>> ib3-st01
>> Filesystem::/dev/mpath/colosse4-lun53-sata::/mnt/data/clun53::lustre
>> ib3-st02
>> Filesystem::/dev/mpath/colosse4-lun54-sata::/mnt/data/clun54::lustre
>> ib3-st03
>> Filesystem::/dev/mpath/colosse4-lun55-sata::/mnt/data/clun55::lustre
>> ib3-st04
>> Filesystem::/dev/mpath/colosse4-lun56-sata::/mnt/data/clun56::lustre
>> ib3-st01
>> Filesystem::/dev/mpath/colosse4-lun57-sata::/mnt/data/clun57::lustre
>> ib3-st02
>> Filesystem::/dev/mpath/colosse4-lun58-sata::/mnt/data/clun58::lustre
>> ib3-st03
>> Filesystem::/dev/mpath/colosse4-lun59-sata::/mnt/data/clun59::lustre
>> ib3-st04
>> Filesystem::/dev/mpath/colosse4-lun60-sata::/mnt/data/clun60::lustre
>>
>>
>>     It is all the same on all OSS's.
>>
>> Does anybody ever encounter  that problem?
>> Thanks for help.
>>
>>
>>
>>
>
>
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
Patrice Hamelin
Specialiste sénior en systèmes d'exploitation | Senior OS specialist
Environnement Canada | Environment Canada
2121, route Transcanadienne | 2121 Transcanada Highway
Dorval, QC H9P 1J3
Téléphone | Telephone 514-421-5303
Télécopieur | Facsimile 514-421-7231
Gouvernement du Canada | Government of Canada




More information about the lustre-discuss mailing list