[Lustre-discuss] OSS Nodes Fencing issue in HPC

Carlos Thomaz cthomaz at ddn.com
Sun Jan 22 22:45:29 PST 2012


Hi Vijesh.

You are probably facing a called "split brain" issue. It may happen due heartbeat communication problems.
One common reason is an issue with your heartbeat. What sort of heartbeat are you using?

Some time ago we had a problem when the OSS were overloaded and the heartbeat becomes unresponsive. This would cause a "false split brain" scenario.
Basically all the two nodes within your HA pair stonith itself since there was no answer from heartbeat device.

I guess you should take a look and start monitoring your oss nodes to understand if the message logged makes sense (very likely). How's the memory configuration of your OSS nodes? What OS? How your zone reclaim mode looks like?

Regards,
Carlos


--
Carlos Thomaz | HPC Systems Architect
Mobile: +1 (303) 519-0578
cthomaz at ddn.com | Skype ID: carlosthomaz

DataDirect Networks, Inc.
9960 Federal Dr., Ste 100 Colorado Springs, CO 80921
ddn.com<http://www.ddn.com/> | Twitter: @ddn_limitless<http://twitter.com/ddn_limitless> | 1.800.TERABYTE

From: VIJESH EK <ekvijesh at gmail.com<mailto:ekvijesh at gmail.com>>
Date: Sun, 22 Jan 2012 22:33:20 -0800
To: "lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>" <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Subject: [Lustre-discuss] OSS Nodes Fencing issue in HPC

" slow start_page_write 57s due to heavy IO load "

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20120122/039c97ba/attachment.htm>


More information about the lustre-discuss mailing list