[Lustre-discuss] [SPAM] Re: [wc-discuss] The ost_connect operation failedwith -16

Thu May 31 02:23:00 PDT 2012

Hi, Qiulan

LU-952 is about a deadlock issue, was the quota enabled? You could try
to disable quota and see if the problem is gone.

Thanks
- Niu

> Hi Zhen,
> Many thanks to your prompt reply. I have disabled the
> writhetrhough_cache and read_cache to see the problem but it still
> hung thread when there is heavy IO.
> May 31 08:38:43 boss33 kernel: LustreError: dumping log to
> /tmp/lustre-log.1338424722.5303
> May 31 08:38:43 boss33 kernel: Pid: 5262, comm: ll_ost_io_48
> May 31 08:38:43 boss33 kernel:
> May 31 08:38:43 boss33 kernel: Call Trace:
> May 31 08:38:43 boss33 kernel: [<ffffffff887bd451>]
> ksocknal_queue_tx_locked+0x451/0x490 [ksocklnd]
> May 31 08:38:43 boss33 kernel: [<ffffffff800646ac>] __down_read+0x7a/0x92
> May 31 08:38:43 boss33 kernel: [<ffffffff889843df>]
> ldiskfs_get_blocks+0x5f/0x2e0 [ldiskfs]
> May 31 08:38:43 boss33 kernel: [<ffffffff889851a0>]
> ldiskfs_get_block+0xc0/0x120 [ldiskfs]
> May 31 08:38:43 boss33 kernel: [<ffffffff88981f60>]
> ldiskfs_bmap+0x0/0xf0 [ldiskfs]
> May 31 08:38:43 boss33 kernel: [<ffffffff80033615>]
> generic_block_bmap+0x37/0x41
> May 31 08:38:43 boss33 kernel: [<ffffffff800341ad>]
> mapping_tagged+0x3c/0x47
> May 31 08:38:43 boss33 kernel: [<ffffffff88981f88>]
> ldiskfs_bmap+0x28/0xf0 [ldiskfs]
> May 31 08:38:43 boss33 kernel: [<ffffffff88981f60>]
> ldiskfs_bmap+0x0/0xf0 [ldiskfs]
> May 31 08:38:43 boss33 kernel: [<ffffffff88a4a288>]
> filter_commitrw_write+0x398/0x2be0 [obdfilter]
> May 31 08:38:43 boss33 kernel: [<ffffffff889e6e5c>]
> ost_checksum_bulk+0x30c/0x5b0 [ost]
> May 31 08:38:43 boss33 kernel: [<ffffffff889e6c38>]
> ost_checksum_bulk+0xe8/0x5b0 [ost]
> May 31 08:38:43 boss33 kernel: [<ffffffff889edcf9>]
> ost_brw_write+0x1c99/0x2480 [ost]
> May 31 08:38:43 boss33 kernel: [<ffffffff8872e658>]
> ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
> May 31 08:38:43 boss33 kernel: [<ffffffff886f98b0>]
> target_committed_to_req+0x40/0x120 [ptlrpc]
> May 31 08:38:43 boss33 kernel: [<ffffffff8008cf93>]
> default_wake_function+0x0/0xe
> May 31 08:38:43 boss33 kernel: [<ffffffff88732bc8>]
> lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
> May 31 08:38:43 boss33 kernel: [<ffffffff889f108e>]
> ost_handle+0x2bae/0x55b0 [ost]
> May 31 08:38:43 boss33 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28
> May 31 08:38:43 boss33 kernel: [<ffffffff800767ae>]
> smp_send_reschedule+0x4e/0x53
> May 31 08:38:43 boss33 kernel: [<ffffffff8874215a>]
> ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
> May 31 08:38:43 boss33 kernel: [<ffffffff887428a8>]
> ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
> May 31 08:38:43 boss33 kernel: [<ffffffff8008b3bd>]
> __wake_up_common+0x3e/0x68
> May 31 08:38:43 boss33 kernel: [<ffffffff88743817>]
> ptlrpc_main+0xf37/0x10f0 [ptlrpc]
> May 31 08:38:43 boss33 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
> May 31 08:38:43 boss33 kernel: [<ffffffff887428e0>]
> ptlrpc_main+0x0/0x10f0 [ptlrpc]
> May 31 08:38:43 boss33 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
> May 31 08:38:43 boss33 kernel:
> May 31 08:38:43 boss33 kernel: LustreError: dumping log to
> /tmp/lustre-log.1338424722.5262
> May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow journal
> start 48s due to heavy IO load
> May 31 08:38:43 boss33 kernel: Lustre: Skipped 1 previous similar message
> May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow brw_start
> 48s due to heavy IO load
> May 31 08:38:43 boss33 kernel: Lustre: Skipped 1 previous similar message
> May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow journal
> start 187s due to heavy IO load
> May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow brw_start
> 187s due to heavy IO load
> I have not patched the bug because the all servers is online. Could
> you know how to deal with it without affecting users?
> Thank you very much.
> Cheers,
> Qiulan
> ====================================================================
> Computing center,the Institute of High Energy Physics, China
> Huang, Qiulan Tel: (+86) 10 8823 6010-105
> P.O. Box 918-7 Fax: (+86) 10 8823 6839
> Beijing 100049 P.R. China Email: huangql at ihep.ac.cn
> <mailto:huangql at ihep.ac.cn>
> ===================================================================
> 2012-05-31
> ------------------------------------------------------------------------
> huangql
> ------------------------------------------------------------------------
> *发件人：* Liang Zhen
> *发送时间：* 2012-05-30 19:12:15
> *收件人：* huangql
> *抄送：* lustre-discuss; wc-discuss
> *主题：* [SPAM] Re: [wc-discuss] The ost_connect operation failedwith -16
> Hi, I think you might hit this:
> http://jira.whamcloud.com/browse/LU-952 , you can find the patch from
> this ticket
>
> Regards
> Liang
>
> On May 30, 2012, at 11:21 AM, huangql wrote:
>
>> Dear all,
>>
>> Recently we found the problem in OSS that some threads might be hung
>> when the server got heavy IO load. In this case, some clients will be
>> evicted or refused by some OSTs and got the error messages as following:
>>
>> Server side:
>>
>> May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was
>> inactive for 200.00s. The thread might be hung, or it might only be
>> slow and will resume later. D
>> umping the stack trace for debugging purposes: May 30 11:06:31 boss07
>> kernel: Lustre: Skipped 1 previous similar message
>> May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71
>> May 30 11:06:31 boss07 kernel:
>> May 30 11:06:31 boss07 kernel: Call Trace:
>> May 30 11:06:31 boss07 kernel: [<ffffffff886f5d0e>]
>> start_this_handle+0x301/0x3cb [jbd2]
>> May 30 11:06:31 boss07 kernel: [<ffffffff800a09ca>]
>> autoremove_wake_function+0x0/0x2e
>> May 30 11:06:31 boss07 kernel: [<ffffffff886f5e83>]
>> jbd2_journal_start+0xab/0xdf [jbd2]
>> May 30 11:06:31 boss07 kernel: [<ffffffff888ce9b2>]
>> fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]
>> May 30 11:06:31 boss07 kernel: [<ffffffff88920551>]
>> filter_version_get_check+0x91/0x2a0 [obdfilter]
>> May 30 11:06:31 boss07 kernel: [<ffffffff80036cf4>]
>> __lookup_hash+0x61/0x12f
>> May 30 11:06:31 boss07 kernel: [<ffffffff8893108d>]
>> filter_setattr_internal+0x90d/0x1de0 [obdfilter]
>> May 30 11:06:31 boss07 kernel: [<ffffffff800e859b>]
>> lookup_one_len+0x53/0x61
>> May 30 11:06:31 boss07 kernel: [<ffffffff88925452>]
>> filter_fid2dentry+0x512/0x740 [obdfilter]
>> May 30 11:06:31 boss07 kernel: [<ffffffff88924e27>]
>> filter_fmd_get+0x2b7/0x320 [obdfilter]
>> May 30 11:06:31 boss07 kernel: [<ffffffff8003027b>] __up_write+0x27/0xf2
>> May 30 11:06:31 boss07 kernel: [<ffffffff88932721>]
>> filter_setattr+0x1c1/0x3b0 [obdfilter]
>> May 30 11:06:31 boss07 kernel: [<ffffffff8882677a>]
>> lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
>> May 30 11:06:31 boss07 kernel: [<ffffffff8881e658>]
>> ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
>> May 30 11:06:31 boss07 kernel: [<ffffffff88822b05>]
>> lustre_msg_get_version+0x35/0xf0 [ptlrpc]
>> May 30 11:06:31 boss07 kernel: [<ffffffff888b0abb>]
>> ost_handle+0x25db/0x55b0 [ost]
>> May 30 11:06:31 boss07 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28
>> May 30 11:06:31 boss07 kernel: [<ffffffff800767ae>]
>> smp_send_reschedule+0x4e/0x53
>> May 30 11:06:31 boss07 kernel: [<ffffffff8883215a>]
>> ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
>> May 30 11:06:31 boss07 kernel: [<ffffffff888328a8>]
>> ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
>> May 30 11:06:31 boss07 kernel: [<ffffffff8008b3bd>]
>> __wake_up_common+0x3e/0x68
>> May 30 11:06:31 boss07 kernel: [<ffffffff88833817>]
>> ptlrpc_main+0xf37/0x10f0 [ptlrpc]
>> May 30 11:06:31 boss07 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
>> May 30 11:06:31 boss07 kernel: [<ffffffff888328e0>]
>> ptlrpc_main+0x0/0x10f0 [ptlrpc]
>> May 30 11:06:31 boss07 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
>> May 30 11:06:31 boss07 kernel:
>> May 30 11:06:31 boss07 kernel: LustreError: dumping log to
>> /tmp/lustre-log.1338347191.8011
>>
>>
>> Client side:
>>
>> May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred
>> while communicating with 192.168.50.123 at tcp. The ost_connect
>> operation failed with -16
>>
>> When you got this error message, you failed to run "ls", "df" ,"vi",
>> "touch" and so on, which affect us to do anything in the file system.
>> I think the ost_connect failure could report some error messages to
>> users instead of causing any interactive actions stuck.
>>
>> Could someone give us some advice or any suggestions to solve this
>> problem?
>>
>> Thank you very much in advance.
>>
>>
>> Best Regards
>> Qiulan Huang
>> 2012-05-30
>> ====================================================================
>> Computing center,the Institute of High Energy Physics, China
>> Huang, Qiulan Tel: (+86) 10 8823 6010-105
>> P.O. Box 918-7 Fax: (+86) 10 8823 6839
>> Beijing 100049 P.R. China Email: huangql at ihep.ac.cn
>> <mailto:huangql at ihep.ac.cn>
>> ===================================================================
>>
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20120531/5a4badb5/attachment.htm>