<html>
<head>
<meta content="text/html; charset=GB2312" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi, Qiulan<br>
<br>
LU-952 is about a deadlock issue, was the quota enabled? You could
try to disable quota and see if the problem is gone.<br>
<br>
Thanks<br>
- Niu<br>
<br>
<blockquote cite="mid:201205311449342959415@ihep.ac.cn" type="cite">
<meta content="text/html; charset=GB2312"
http-equiv="Content-Type">
<meta name="GENERATOR" content="MSHTML 9.00.8112.16443">
<style>@font-face {
font-family: 宋体;
}
@font-face {
font-family: Verdana;
}
@font-face {
font-family: @宋体;
}
@page Section1 {size: 595.3pt 841.9pt; margin: 72.0pt 90.0pt 72.0pt 90.0pt; layout-grid: 15.6pt; }
P.MsoNormal {
TEXT-JUSTIFY: inter-ideograph; TEXT-ALIGN: justify; MARGIN: 0cm 0cm 0pt; FONT-FAMILY: "Times New Roman"; FONT-SIZE: 10.5pt
}
LI.MsoNormal {
TEXT-JUSTIFY: inter-ideograph; TEXT-ALIGN: justify; MARGIN: 0cm 0cm 0pt; FONT-FAMILY: "Times New Roman"; FONT-SIZE: 10.5pt
}
DIV.MsoNormal {
TEXT-JUSTIFY: inter-ideograph; TEXT-ALIGN: justify; MARGIN: 0cm 0cm 0pt; FONT-FAMILY: "Times New Roman"; FONT-SIZE: 10.5pt
}
A:link {
COLOR: blue; TEXT-DECORATION: underline
}
SPAN.MsoHyperlink {
COLOR: blue; TEXT-DECORATION: underline
}
A:visited {
COLOR: purple; TEXT-DECORATION: underline
}
SPAN.MsoHyperlinkFollowed {
COLOR: purple; TEXT-DECORATION: underline
}
SPAN.EmailStyle17 {
FONT-STYLE: normal; FONT-FAMILY: Verdana; COLOR: windowtext; FONT-WEIGHT: normal; TEXT-DECORATION: none; mso-style-type: personal-compose
}
DIV.Section1 {
page: Section1
}
UNKNOWN {
FONT-SIZE: 10pt
}
BLOCKQUOTE {
MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px; MARGIN-LEFT: 2em
}
OL {
MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px
}
UL {
MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px
}
</style>
<div><font color="#000080" face="Verdana" size="2">Hi Zhen,</font></div>
<div> </div>
<div><font color="#000080">Many thanks to your prompt reply. I
have disabled the writhetrhough_cache and read_cache to see
the problem but it still hung thread when there is heavy IO.
</font></div>
<div> </div>
<div> </div>
<div>
<div>May 31 08:38:43 boss33 kernel: LustreError: dumping log to /tmp/lustre-log.1338424722.5303</div>
<div>May 31 08:38:43 boss33 kernel: Pid: 5262, comm: ll_ost_io_48</div>
<div>May 31 08:38:43 boss33 kernel:</div>
<div>May 31 08:38:43 boss33 kernel: Call Trace:</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff887bd451>] ksocknal_queue_tx_locked+0x451/0x490 [ksocklnd]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff800646ac>] __down_read+0x7a/0x92</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff889843df>] ldiskfs_get_blocks+0x5f/0x2e0 [ldiskfs]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff889851a0>] ldiskfs_get_block+0xc0/0x120 [ldiskfs]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff88981f60>] ldiskfs_bmap+0x0/0xf0 [ldiskfs]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff80033615>] generic_block_bmap+0x37/0x41</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff800341ad>] mapping_tagged+0x3c/0x47</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff88981f88>] ldiskfs_bmap+0x28/0xf0 [ldiskfs]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff88981f60>] ldiskfs_bmap+0x0/0xf0 [ldiskfs]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff88a4a288>] filter_commitrw_write+0x398/0x2be0 [obdfilter]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff889e6e5c>] ost_checksum_bulk+0x30c/0x5b0 [ost]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff889e6c38>] ost_checksum_bulk+0xe8/0x5b0 [ost]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff889edcf9>] ost_brw_write+0x1c99/0x2480 [ost]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff8872e658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff886f98b0>] target_committed_to_req+0x40/0x120 [ptlrpc]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff8008cf93>] default_wake_function+0x0/0xe</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff88732bc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff889f108e>] ost_handle+0x2bae/0x55b0 [ost]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff800767ae>] smp_send_reschedule+0x4e/0x53</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff8874215a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff887428a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff8008b3bd>] __wake_up_common+0x3e/0x68</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff88743817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff887428e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc]</div>
<div>May 31 08:38:43 boss33 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11</div>
<div>May 31 08:38:43 boss33 kernel:</div>
<div>May 31 08:38:43 boss33 kernel: LustreError: dumping log to /tmp/lustre-log.1338424722.5262</div>
<div>May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow journal start 48s due to heavy IO load</div>
<div>May 31 08:38:43 boss33 kernel: Lustre: Skipped 1 previous similar message</div>
<div>May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow brw_start 48s due to heavy IO load</div>
<div>May 31 08:38:43 boss33 kernel: Lustre: Skipped 1 previous similar message</div>
<div>May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow journal start 187s due to heavy IO load</div>
<div>May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow brw_start 187s due to heavy IO load</div>
</div>
<div> </div>
<div><font color="#000080">I have not patched the bug because the
all servers is online. Could you know how to deal with it
without affecting users?</font></div>
<div> </div>
<div><font color="#000080">Thank you very much.</font></div>
<div> </div>
<div> </div>
<div><font color="#000080">Cheers,</font></div>
<div><font color="#000080">Qiulan</font></div>
<div><font color="#000080" face="Verdana" size="2"><font
color="#000000">====================================================================<br>
Computing center,the Institute of High Energy Physics, China<br>
Huang, Qiulan Tel: (+86) 10 8823 6010-105<br>
P.O. Box 918-7 Fax: (+86) 10 8823 6839<br>
Beijing 100049 P.R. China Email: </font><a
moz-do-not-send="true" href="mailto:huangql@ihep.ac.cn">huangql@ihep.ac.cn</a><br>
<font color="#000000">===================================================================<span
style="WHITE-SPACE: pre" class="Apple-tab-span"> </span></font><br>
</font></div>
<div> </div>
<div><font color="#c0c0c0" face="Verdana" size="2">2012-05-31 </font></div>
<font color="#000080" face="Verdana" size="2">
<hr style="WIDTH: 100px" align="left" color="#b5c4df" size="1">
</font>
<div><font color="#c0c0c0" face="Verdana" size="2"><span>huangql</span>
</font></div>
<hr color="#b5c4df" size="1">
<div><font face="Verdana" size="2"><strong>发件人:</strong> Liang
Zhen </font></div>
<div><font face="Verdana" size="2"><strong>发送时间:</strong>
2012-05-30 19:12:15 </font></div>
<div><font face="Verdana" size="2"><strong>收件人:</strong> huangql </font></div>
<div><font face="Verdana" size="2"><strong>抄送:</strong>
lustre-discuss; wc-discuss </font></div>
<div><font face="Verdana" size="2"><strong>主题:</strong> [SPAM] Re:
[wc-discuss] The ost_connect operation failedwith -16 </font></div>
<div> </div>
<div><font face="Verdana" size="2">Hi, I think you might hit
this: <a moz-do-not-send="true"
href="http://jira.whamcloud.com/browse/LU-952">http://jira.whamcloud.com/browse/LU-952</a> ,
you can find the patch from this ticket
<div><br>
</div>
<div>Regards</div>
<div>Liang<br>
<div>
<div><br>
<div>
<div>On May 30, 2012, at 11:21 AM, huangql wrote:</div>
<br class="Apple-interchange-newline">
<blockquote type="cite">
<div>Dear all,<br>
<br>
Recently we found the problem in OSS that some
threads might be hung when the server got heavy IO
load. In this case, some clients will be evicted
or refused by some OSTs and got the error messages
as following:<br>
<br>
Server side:<br>
<br>
May 30 11:06:31 boss07 kernel: Lustre: Service
thread pid 8011 was inactive for 200.00s. The
thread might be hung, or it might only be slow and
will resume later. D<br>
umping the stack trace for debugging purposes: May
30 11:06:31 boss07 kernel: Lustre: Skipped 1
previous similar message<br>
May 30 11:06:31 boss07 kernel: Pid: 8011, comm:
ll_ost_71 <br>
May 30 11:06:31 boss07 kernel: <br>
May 30 11:06:31 boss07 kernel: Call Trace:<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff886f5d0e>]
start_this_handle+0x301/0x3cb [jbd2]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff800a09ca>]
autoremove_wake_function+0x0/0x2e<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff886f5e83>]
jbd2_journal_start+0xab/0xdf [jbd2]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff888ce9b2>]
fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff88920551>]
filter_version_get_check+0x91/0x2a0 [obdfilter]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff80036cf4>]
__lookup_hash+0x61/0x12f<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff8893108d>]
filter_setattr_internal+0x90d/0x1de0 [obdfilter]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff800e859b>]
lookup_one_len+0x53/0x61<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff88925452>]
filter_fid2dentry+0x512/0x740 [obdfilter]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff88924e27>]
filter_fmd_get+0x2b7/0x320 [obdfilter]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff8003027b>] __up_write+0x27/0xf2<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff88932721>]
filter_setattr+0x1c1/0x3b0 [obdfilter]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff8882677a>]
lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff8881e658>]
ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff88822b05>]
lustre_msg_get_version+0x35/0xf0 [ptlrpc]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff888b0abb>]
ost_handle+0x25db/0x55b0 [ost]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff80150d56>] __next_cpu+0x19/0x28<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff800767ae>]
smp_send_reschedule+0x4e/0x53<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff8883215a>]
ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff888328a8>]
ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff8008b3bd>]
__wake_up_common+0x3e/0x68<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff88833817>]
ptlrpc_main+0xf37/0x10f0 [ptlrpc]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff8005dfb1>] child_rip+0xa/0x11<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff888328e0>] ptlrpc_main+0x0/0x10f0
[ptlrpc]<br>
May 30 11:06:31 boss07 kernel:
[<ffffffff8005dfa7>] child_rip+0x0/0x11<br>
May 30 11:06:31 boss07 kernel:<br>
May 30 11:06:31 boss07 kernel: LustreError:
dumping log to /tmp/lustre-log.1338347191.8011<br>
<br>
<br>
Client side:<br>
<br>
May 30 09:58:36 ccopt kernel: LustreError: 11-0:
an error occurred while communicating with
192.168.50.123@tcp. The ost_connect operation
failed with -16<br>
<br>
When you got this error message, you failed to run
"ls", "df" ,"vi", "touch" and so on, which affect
us to do anything in the file system.<br>
I think the ost_connect failure could report some
error messages to users instead of causing any
interactive actions stuck.<br>
<br>
Could someone give us some advice or any
suggestions to solve this problem?<br>
<br>
Thank you very much in advance.<br>
<br>
<br>
Best Regards<br>
Qiulan Huang<br>
2012-05-30<br>
====================================================================<br>
Computing center,the Institute of High Energy
Physics, China<br>
Huang, Qiulan Tel: (+86) 10
8823 6010-105<br>
P.O. Box 918-7 Fax: (+86) 10
8823 6839<br>
Beijing 100049 P.R. China Email: <a
moz-do-not-send="true"
href="mailto:huangql@ihep.ac.cn">huangql@ihep.ac.cn</a><br>
===================================================================<span
style="WHITE-SPACE: pre" class="Apple-tab-span">
</span><br>
<br>
<br>
<br>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</font></div>
</blockquote>
<br>
</body>
</html>