[Lustre-discuss] OSS Nodes Fencing issue in HPC

Mon Jan 30 22:10:47 PST 2012

Yes, change it to 0.  This will make it easier to allocate memory.  Although it will sometimes allocate memory connected to the wrong CPU, it shouldn't get stuck for long periods in the memory allocator.  Because of the Lustre oss cache (starting in 1.8.0), service threads have to allocate new memory for every request.  Your lustre server threads are getting stuck allocating memory.

I expect that you will see many fewer "slow" messages on the servers after making that change.

Kevin

On Jan 30, 2012, at 11:03 PM, VIJESH EK wrote:

Dear Sir,

I have checked the file /proc/sys/vm/zone_reclaim_mode , and found that its value is 1 in four OSS servers (OSS1 to OSS4). Should i change to 0 in all nodes.  I want to know one thing , How it can be resolve the current issue ? Can u please explain?, What is the main function of this file ?
Have u verified the log file which one i has been sent earlier ?. If i have changed the value to 0, Is it will effect currently running  processes or Jobs ?
I am waiting for your reply....

Thanks & Regards

VIJESH E K

On Tue, Jan 31, 2012 at 12:21 AM, Kevin Van Maren <KVanMaren at fusionio.com<mailto:KVanMaren at fusionio.com>> wrote:
As I replied earlier, those "slow" messages are often a result of memory allocations taking a long time.  Since zone_reclaim shows up in many of the stack traces, that still appears to be a good candidate.

Did you check /proc/sys/vm/zone_reclaim_mode and was it 0?  Did you change it to 0 and still have problems?

The same situation that causes the Lustre threads to be slow can also stall the heartbeat processes.  Did you increase the heartbeat deadtime timeout value?

Kevin

On Jan 27, 2012, at 1:42 AM, VIJESH EK wrote:

Dear Sir,

I have attached the /var/log/messages from the OSS node ,
Please go through the logs and kindly give me a solution for this issue........

Thanks & Regards

VIJESH E K
HCL Infosystems Ltd.
Chennai-6
Mob:+91 99400 96543

On Mon, Jan 23, 2012 at 12:03 PM, VIJESH EK <ekvijesh at gmail.com<mailto:ekvijesh at gmail.com>> wrote:
Hi,

 I hope all of them are in good spirit....

We have a four OSS servers, OSS1 to OSS4 are clustered each other
The Nodes are clustered with OSS1 and OSS2 , OSS3 & OSS4.
It was configured six months back, from the beginning itself its creacting
an issue that one of  node is fencing the other node and its goes to the shutdown state.
This problem may be happen from two to three weeks timing period.
In the /var/log/messages showing some errors continuously that
" slow start_page_write 57s due to heavy IO load "
Can anybody can help me regarding this issue.....

Thanks & Regards

VIJESH E K

<messages.3><messages><messages.1><messages.2><ATT00001..txt>

Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor.

This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited.  Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20120130/40df1def/attachment.htm>