[Lustre-discuss] OSS Nodes Fencing issue in HPC

Kevin Van Maren KVanMaren at fusionio.com
Mon Jan 30 10:51:04 PST 2012


As I replied earlier, those "slow" messages are often a result of memory allocations taking a long time.  Since zone_reclaim shows up in many of the stack traces, that still appears to be a good candidate.

Did you check /proc/sys/vm/zone_reclaim_mode and was it 0?  Did you change it to 0 and still have problems?

The same situation that causes the Lustre threads to be slow can also stall the heartbeat processes.  Did you increase the heartbeat deadtime timeout value?

Kevin


On Jan 27, 2012, at 1:42 AM, VIJESH EK wrote:

Dear Sir,

I have attached the /var/log/messages from the OSS node ,
Please go through the logs and kindly give me a solution for this issue........

Thanks & Regards

VIJESH E K
HCL Infosystems Ltd.
Chennai-6
Mob:+91 99400 96543


On Mon, Jan 23, 2012 at 12:03 PM, VIJESH EK <ekvijesh at gmail.com<mailto:ekvijesh at gmail.com>> wrote:
Hi,

 I hope all of them are in good spirit....

We have a four OSS servers, OSS1 to OSS4 are clustered each other
The Nodes are clustered with OSS1 and OSS2 , OSS3 & OSS4.
It was configured six months back, from the beginning itself its creacting
an issue that one of  node is fencing the other node and its goes to the shutdown state.
This problem may be happen from two to three weeks timing period.
In the /var/log/messages showing some errors continuously that
" slow start_page_write 57s due to heavy IO load "
Can anybody can help me regarding this issue.....


Thanks & Regards

VIJESH E K






<messages.3><messages><messages.1><messages.2><ATT00001..txt>


Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited.  Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20120130/69d0e75a/attachment.htm>


More information about the lustre-discuss mailing list