[Lustre-discuss] OSS Nodes Fencing issue in HPC

Wojciech Turek wjt27 at cam.ac.uk
Wed Jan 25 05:43:24 PST 2012


You have got already a great advice from Carlos and Kevin. One more point I
would like to add is that quite often people configure their HA software to
send heartbeat over a single network thus creating a single point of
failure and the heartbeat (keep alive) pings are sent over the network that
is used as the main I/O feed. In my experience I found that  it is very
important to have the HA ping to be send at least on two networks or even
better using two different methods of comm like Ethernet and serial.

Regards,

Wojciech

On 23 January 2012 06:33, VIJESH EK <ekvijesh at gmail.com> wrote:

> Hi,
>
>  I hope all of them are in good spirit....
>
> *We have a four OSS servers, OSS1 to OSS4 are clustered each other*
> *The Nodes are clustered with OSS1 and OSS2 , OSS3 & OSS4.*
> *It was configured six months back, from the beginning itself its
> creacting *
> *an issue that one of  node is fencing the other node and its goes to the
> shutdown state.*
> *This problem may be happen from two to three weeks timing period.*
> *In the /var/log/messages showing some errors continuously that *
> *" slow start_page_write 57s due to heavy IO load "*
> *Can anybody can help me regarding this issue.....*
> *
> *
>
> Thanks & Regards
> *
> VIJESH E K*
> *
> *
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20120125/4824e4c5/attachment.htm>


More information about the lustre-discuss mailing list