[Lustre-discuss] OSS Nodes Fencing issue in HPC

Mon Jan 30 22:03:22 PST 2012

*Dear Sir,*
*
*
*I have checked the file **/proc/sys/vm/zone_reclaim_mode , and found
that its value is 1 in four OSS servers (OSS1 to OSS4). Should i change to
0 in all nodes.  I want to know one thing , How it can be resolve the
current issue ? Can u please explain?, What is the main function of this
file ?*
*Have u verified the log file which one i has been sent earlier ?. If i
have changed the value to 0, Is it will effect currently running  processes
or Jobs ?*
*I am waiting for your reply....*
*
*
*Thanks & Regards

VIJESH E K*
*
*
On Tue, Jan 31, 2012 at 12:21 AM, Kevin Van Maren <KVanMaren at fusionio.com>wrote:

> As I replied earlier, those "slow" messages are often a result of memory
> allocations taking a long time.  Since zone_reclaim shows up in many of the
> stack traces, that still appears to be a good candidate.
>
> Did you check /proc/sys/vm/zone_reclaim_mode and was it 0?  Did you change
> it to 0 and still have problems?
>
> The same situation that causes the Lustre threads to be slow can also
> stall the heartbeat processes.  Did you increase the heartbeat deadtime
> timeout value?
>
> Kevin
>
>
> On Jan 27, 2012, at 1:42 AM, VIJESH EK wrote:
>
> *Dear Sir,*
> *
> *
> *I have attached the /var/log/messages from the OSS node ,*
> *Please go through the logs and kindly give me a solution for this
> issue........
> *
> *
> *
> *Thanks & Regards
>
> VIJESH E K*
> *HCL Infosystems Ltd.
> Chennai-6
> Mob:+91 99400 96543*
>
>
> On Mon, Jan 23, 2012 at 12:03 PM, VIJESH EK <ekvijesh at gmail.com> wrote:
>
>> Hi,
>>
>>  I hope all of them are in good spirit....
>>
>> *We have a four OSS servers, OSS1 to OSS4 are clustered each other*
>> *The Nodes are clustered with OSS1 and OSS2 , OSS3 & OSS4.*
>> *It was configured six months back, from the beginning itself its
>> creacting *
>> *an issue that one of  node is fencing the other node and its goes to
>> the shutdown state.*
>> *This problem may be happen from two to three weeks timing period.*
>> *In the /var/log/messages showing some errors continuously that *
>> *" slow start_page_write 57s due to heavy IO load "*
>> *Can anybody can help me regarding this issue.....*
>> *
>> *
>>
>> Thanks & Regards
>> *
>> VIJESH E K*
>> *
>> *
>>
>>
>
>
>
>  <messages.3><messages><messages.1><messages.2><ATT00001..txt>
>
>
>
>
> Confidentiality Notice: This e-mail message, its contents and any
> attachments to it are confidential to the intended recipient, and may
> contain information that is privileged and/or exempt from disclosure under
> applicable law. If you are not the intended recipient, please immediately
> notify the sender and destroy the original e-mail message and any
> attachments (and any copies that may have been made) from your system or
> otherwise. Any unauthorized use, copying, disclosure or distribution of
> this information is strictly prohibited. Email addresses that end with a
> ?-c? identify the sender as a Fusion-io contractor.
>   
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20120131/ca1a2be8/attachment.htm>