[Lustre-discuss] The ost_connect operation failed with -16
huangql
huangql at ihep.ac.cn
Tue May 29 20:21:47 PDT 2012
Dear all,
Recently we found the problem in OSS that some threads might be hung when the server got heavy IO load. In this case, some clients will be evicted or refused by some OSTs and got the error messages as following:
Server side:
May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. D
umping the stack trace for debugging purposes: May 30 11:06:31 boss07 kernel: Lustre: Skipped 1 previous similar message
May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71
May 30 11:06:31 boss07 kernel:
May 30 11:06:31 boss07 kernel: Call Trace:
May 30 11:06:31 boss07 kernel: [<ffffffff886f5d0e>] start_this_handle+0x301/0x3cb [jbd2]
May 30 11:06:31 boss07 kernel: [<ffffffff800a09ca>] autoremove_wake_function+0x0/0x2e
May 30 11:06:31 boss07 kernel: [<ffffffff886f5e83>] jbd2_journal_start+0xab/0xdf [jbd2]
May 30 11:06:31 boss07 kernel: [<ffffffff888ce9b2>] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]
May 30 11:06:31 boss07 kernel: [<ffffffff88920551>] filter_version_get_check+0x91/0x2a0 [obdfilter]
May 30 11:06:31 boss07 kernel: [<ffffffff80036cf4>] __lookup_hash+0x61/0x12f
May 30 11:06:31 boss07 kernel: [<ffffffff8893108d>] filter_setattr_internal+0x90d/0x1de0 [obdfilter]
May 30 11:06:31 boss07 kernel: [<ffffffff800e859b>] lookup_one_len+0x53/0x61
May 30 11:06:31 boss07 kernel: [<ffffffff88925452>] filter_fid2dentry+0x512/0x740 [obdfilter]
May 30 11:06:31 boss07 kernel: [<ffffffff88924e27>] filter_fmd_get+0x2b7/0x320 [obdfilter]
May 30 11:06:31 boss07 kernel: [<ffffffff8003027b>] __up_write+0x27/0xf2
May 30 11:06:31 boss07 kernel: [<ffffffff88932721>] filter_setattr+0x1c1/0x3b0 [obdfilter]
May 30 11:06:31 boss07 kernel: [<ffffffff8882677a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
May 30 11:06:31 boss07 kernel: [<ffffffff8881e658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
May 30 11:06:31 boss07 kernel: [<ffffffff88822b05>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
May 30 11:06:31 boss07 kernel: [<ffffffff888b0abb>] ost_handle+0x25db/0x55b0 [ost]
May 30 11:06:31 boss07 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28
May 30 11:06:31 boss07 kernel: [<ffffffff800767ae>] smp_send_reschedule+0x4e/0x53
May 30 11:06:31 boss07 kernel: [<ffffffff8883215a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
May 30 11:06:31 boss07 kernel: [<ffffffff888328a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
May 30 11:06:31 boss07 kernel: [<ffffffff8008b3bd>] __wake_up_common+0x3e/0x68
May 30 11:06:31 boss07 kernel: [<ffffffff88833817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc]
May 30 11:06:31 boss07 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
May 30 11:06:31 boss07 kernel: [<ffffffff888328e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc]
May 30 11:06:31 boss07 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
May 30 11:06:31 boss07 kernel:
May 30 11:06:31 boss07 kernel: LustreError: dumping log to /tmp/lustre-log.1338347191.8011
Client side:
May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.123 at tcp. The ost_connect operation failed with -16
When you got this error message, you failed to run "ls", "df" ,"vi", "touch" and so on, which affect us to do anything in the file system.
I think the ost_connect failure could report some error messages to users instead of causing any interactive actions stuck.
Could someone give us some advice or any suggestions to solve this problem?
Thank you very much in advance.
Best Regards
Qiulan Huang
2012-05-30
====================================================================
Computing center,the Institute of High Energy Physics, China
Huang, Qiulan Tel: (+86) 10 8823 6010-105
P.O. Box 918-7 Fax: (+86) 10 8823 6839
Beijing 100049 P.R. China Email: huangql at ihep.ac.cn
===================================================================
More information about the lustre-discuss
mailing list