[Lustre-discuss] OSS Nodes Fencing issue in HPC

Kevin Van Maren KVanMaren at fusionio.com
Sun Jan 22 22:46:23 PST 2012

Well, it sounds like an issue with your HA package configuration.  Likely one node is not being responsive enough to heartbeat/are-you-alive messages so the other node assumes it has died.  This is likely fixed by increasing the deadtime parameter in your HA configuration (try 180 seconds if it is smaller than that).  Hard to say, as you omitted any logs, and you didn't even say what HA package you are using.

You also didn't indicate which Lustre version you are using.  One of the likely candidates for those messages is the kernel having difficulty allocating memory.  On many kernels, if /proc/sys/vm/zone_reclaim_mode is not 0, memory allocations can take a long time as it keeps looking for the best pages to free until pages in the local NUMA node are available.   With the Lustre 1.8.x write cache, the memory pressure is substantial (in 1.6.x and earlier, the service threads had statically-allocated buffers, but starting with 1.8.x each incoming request allocates new pages and frees them back to the page cache).


On Jan 22, 2012, at 11:33 PM, VIJESH EK wrote:


 I hope all of them are in good spirit....

We have a four OSS servers, OSS1 to OSS4 are clustered each other
The Nodes are clustered with OSS1 and OSS2 , OSS3 & OSS4.
It was configured six months back, from the beginning itself its creacting
an issue that one of  node is fencing the other node and its goes to the shutdown state.
This problem may be happen from two to three weeks timing period.
In the /var/log/messages showing some errors continuously that
" slow start_page_write 57s due to heavy IO load "
Can anybody can help me regarding this issue.....

Thanks & Regards



Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited.  Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20120122/5828bba2/attachment.htm>

More information about the lustre-discuss mailing list