[Lustre-discuss] failover software - heartbeat

Tue Jul 14 06:09:35 PDT 2009

I have been able to get pingd working for interconnect failover using Heartbeat
V2.1.3.  It works for OSTs, MDTs and the MGS.  

You will need lines like this in your ha.cf file, 2 pingd devices shown, 
but 1 will do:

ping 172.31.80.240
ping 172.31.64.1
respawn root /usr/lib64/heartbeat/pingd -m 200 -d 5s -p /var/run/pingd.pid -h 172.31.80.240 -h 172.31.64.1

You also need to add pingd rsc_location rules to your cib.xml file within the
constraints section as shown below, one for each Lustre filesystem:

      <rsc_location id="testfsmds_connected" rsc="testfsmds">
        <rule id="testfsmds_connected_rule" score_attribute="pingd">
          <expression id="testfsmds_connected_rule_expr" attribute="pingd" operation="defined"/>
        </rule>
      </rsc_location>

This has worked well for me for InfiniBand and 10GbE systems.  

HTH,
Bob

>/ -----Original Message-----
/>/ From: Jim Garlick [mailto:garlick at llnl.gov <http://lists.lustre.org/mailman/listinfo/lustre-discuss>]
/>/ Sent: Monday, July 13, 2009 2:39 PM
/>/ To: Lundgren, Andrew
/>/ Cc: Carlos Santana; lustre-discuss at lists.lustre.org <http://lists.lustre.org/mailman/listinfo/lustre-discuss>
/>/ Subject: Re: [Lustre-discuss] failover software - heartbeat
/>/ 
/>/ No.  I originally did have it set up like this (a v1 ha.cf snippet):
/>/ 
/>/ # One partner losing contact with both lnet routers or MDS triggers
/>/ failover.
/>/ #ping_group lnet-router 172.16.10.254 172.16.2.254
/>/ #ping_group tycho-mds1 172.16.10.200 172.16.2.200
/>/ #respawn hacluster /usr/lib64/heartbeat/ipfail
/>/ 
/>/ However, I ran into a problem when rebooting the MDS.  Apparently if
/>/ one
/>/ partner re-establishes contact with the MDS before the other one, it
/>/ immediately triggers failover.  This is with heartbeat-2.1.4.
/>/ 
/>/ Jim
/>/ 
/>/ On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote:
/>/ > Were you able to get monitoring working to detect network failures?
/>/ (pingd?)
/>/ >
/>/ > I have it configured, but haven't been able to get it to trigger a
/>/ failover when an MDS cannot ping the network.  (I tried with 1.0 and
/>/ 2.0 conf files,  I am currently using 2.0)  I have a ticket open with
/>/ the pacemaker project (no ticket system for the HA stuff...)
/>/ > but not resolution.  I am considering writing a script to down the
/>/ node when the ping fails, but don't like the idea.
/>/ >
/>/ > I would also like to get the hpingd functioning to detect a fiber
/>/ failure, but there was less available on that solution.
/>/ >
/>/ > --
/>/ > Andrew
/>/ >
/