[Lustre-discuss] Thread might be hung, Heavy IO Load messages

Wed Feb 1 10:57:50 PST 2012

As of late I've been seeing alot of these messages:

Lustre: Service thread pid 22974 was inactive for 200.00s. The thread
might be hung, or it might only be slow and will resume later. Dumping
the stack trace for debugging purposes:
Pid: 22974, comm: ll_ost_io_233

Call Trace:
 [<ffffffff8006e1db>] do_gettimeofday+0x40/0x90
 [<ffffffff8001546f>] sync_buffer+0x0/0x3f
 [<ffffffff800637ea>] io_schedule+0x3f/0x67
 [<ffffffff800154aa>] sync_buffer+0x3b/0x3f
 [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e
 [<ffffffff8001546f>] sync_buffer+0x0/0x3f
 [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a0ae0>] wake_bit_function+0x0/0x23
 [<ffffffff88945bc8>] bh_submit_read+0x58/0x70 [ldiskfs]
 [<ffffffff88945ef8>] read_block_bitmap+0xc8/0x1c0 [ldiskfs]
 [<ffffffff88968101>] ldiskfs_mb_free_blocks+0x191/0x5d0 [ldiskfs]
 [<ffffffff8894a0d1>] ldiskfs_mark_iloc_dirty+0x411/0x480 [ldiskfs]
 [<ffffffff88030d09>] do_get_write_access+0x4f9/0x530 [jbd]
 [<ffffffff80007691>] find_get_page+0x21/0x51
 [<ffffffff80010c54>] __find_get_block_slow+0x2f/0xf7
 [<ffffffff88946d4d>] ldiskfs_free_blocks+0x8d/0xe0 [ldiskfs]
 [<ffffffff8895ee66>] ldiskfs_ext_remove_space+0x3a6/0x740 [ldiskfs]
 [<ffffffff88960401>] ldiskfs_ext_truncate+0x161/0x1f0 [ldiskfs]
 [<ffffffff8894c881>] ldiskfs_truncate+0xc1/0x610 [ldiskfs]
 [<ffffffff800cd34f>] unmap_mapping_range+0x59/0x204
 [<ffffffff8894a0d1>] ldiskfs_mark_iloc_dirty+0x411/0x480 [ldiskfs]
 [<ffffffff800cdd9d>] vmtruncate+0xa2/0xc9
 [<ffffffff800417a6>] inode_setattr+0x22/0x104
 [<ffffffff8894df3b>] ldiskfs_setattr+0x1eb/0x270 [ldiskfs]
 [<ffffffff889ca037>] fsfilt_ldiskfs_setattr+0x1a7/0x250 [fsfilt_ldiskfs]
 [<ffffffff889e5551>] filter_version_get_check+0x91/0x2a0 [obdfilter]
 [<ffffffff800645ab>] __down_write_nested+0x12/0x92
 [<ffffffff885b1378>] cfs_alloc+0x68/0xc0 [libcfs]
 [<ffffffff889f2bfb>] filter_destroy+0xd9b/0x1fb0 [obdfilter]
 [<ffffffff886f0bc0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
 [<ffffffff886f42a0>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
 [<ffffffff88703549>] ldlm_srv_pool_recalc+0x79/0x220 [ptlrpc]
 [<ffffffff88719924>] lustre_msg_add_version+0x34/0x110 [ptlrpc]
 [<ffffffff8871c62a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
 [<ffffffff886dba4c>] ldlm_resource_putref+0x34c/0x3c0 [ptlrpc]
 [<ffffffff886d68d2>] ldlm_lock_put+0x372/0x3d0 [ptlrpc]
 [<ffffffff8871c739>] lustre_pack_reply+0x29/0xb0 [ptlrpc]
 [<ffffffff889a4050>] ost_destroy+0x660/0x790 [ost]
 [<ffffffff88718a78>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
 [<ffffffff887188c5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
 [<ffffffff889ada26>] ost_handle+0x1556/0x55b0 [ost]
 [<ffffffff800dbcaa>] free_block+0x126/0x143
 [<ffffffff800dbeec>] __drain_alien_cache+0x51/0x66
 [<ffffffff88725c37>] ptlrpc_server_handle_request+0xaa7/0x1150 [ptlrpc]
 [<ffffffff800470f3>] try_to_wake_up+0x472/0x484
 [<ffffffff80062ff8>] thread_return+0x62/0xfe
 [<ffffffff8008b4a5>] __wake_up_common+0x3e/0x68
 [<ffffffff88729698>] ptlrpc_main+0x1258/0x1420 [ptlrpc]
 [<ffffffff8008d07b>] default_wake_function+0x0/0xe
 [<ffffffff800b7a9c>] audit_syscall_exit+0x336/0x362
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff88728440>] ptlrpc_main+0x0/0x1420 [ptlrpc]
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

or

Pid: 13507, comm: ll_ost_io_205
LustreError: dumping log to /tmp/lustre-log.1328117954.22974
Lustre: lustre-OST0001: slow journal start 34s due to heavy IO load
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0001: slow brw_start 34s due to heavy IO load
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0001: slow journal start 36s due to heavy IO load
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0001: slow journal start 37s due to heavy IO load
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0001: slow commitrw commit 37s due to heavy IO load
Lustre: lustre-OST0001: slow i_mutex 34s due to heavy IO load
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0001: slow i_mutex 34s due to heavy IO load
Lustre: lustre-OST0000: slow setattr 44s due to heavy IO load
Lustre: lustre-OST0000: slow setattr 44s due to heavy IO load
Lustre: lustre-OST0001: slow setattr 106s due to heavy IO load
Lustre: Skipped 2 previous similar messages
Lustre: lustre-OST0001: slow setattr 106s due to heavy IO load

On the MDS I see the following:

LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0x234ca921 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xd03d1 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0x14e2441 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xc1861 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xd0581 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xd0272 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xc8042 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xc805c sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xc805c sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xd024d sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xd0311 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xd0316 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xd032a sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xd032b sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xc8fd6 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0xc1847 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0x234ca903 sub-object on OST idx 3/4: rc = -107
LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error
creating fid 0x234ca904 sub-object on OST idx 3/4: rc = -107
Lustre: Service thread pid 25775 was inactive for 222.00s. The thread
might be hung, or it might only be slow and will resume later. Dumping
the stack trace for debugging purposes:
Lustre: Skipped 3 previous similar messages
Pid: 25775, comm: ll_mdt_117

Call Trace:
 [<ffffffff800638ab>] schedule_timeout+0x8a/0xad
 [<ffffffff80097d9f>] process_timeout+0x0/0x5
 [<ffffffff88809c75>] osc_create+0xc75/0x13d0 [osc]
 [<ffffffff8008d07b>] default_wake_function+0x0/0xe
 [<ffffffff888b7cbb>] qos_remedy_create+0x45b/0x570 [lov]
 [<ffffffff888b1be3>] lov_fini_create_set+0x243/0x11e0 [lov]
 [<ffffffff888a5982>] lov_create+0x1552/0x1860 [lov]
 [<ffffffff888a65b6>] lov_iocontrol+0x926/0xf0f [lov]
 [<ffffffff8008d07b>] default_wake_function+0x0/0xe
 [<ffffffff88a6140a>] mds_finish_open+0x1fea/0x43e0 [mds]
 [<ffffffff88030d09>] do_get_write_access+0x4f9/0x530 [jbd]
 [<ffffffff889740d1>] ldiskfs_mark_iloc_dirty+0x411/0x480 [ldiskfs]
 [<ffffffff88974796>] ldiskfs_mark_inode_dirty+0x136/0x160 [ldiskfs]
 [<ffffffff889740d1>] ldiskfs_mark_iloc_dirty+0x411/0x480 [ldiskfs]
 [<ffffffff88a6845e>] mds_open+0x2cce/0x35f8 [mds]
 [<ffffffff887cddbf>] ksocknal_find_conn_locked+0xcf/0x1f0 [ksocklnd]
 [<ffffffff887cfef5>] ksocknal_alloc_tx+0x1f5/0x2a0 [ksocklnd]
 [<ffffffff88a3ef89>] mds_reint_rec+0x1d9/0x2b0 [mds]
 [<ffffffff88a6ac72>] mds_open_unpack+0x312/0x430 [mds]
 [<ffffffff88a31e7a>] mds_reint+0x35a/0x420 [mds]
 [<ffffffff88a30d8a>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
 [<ffffffff88a3bbfc>] mds_intent_policy+0x4ac/0xc80 [mds]
 [<ffffffff887058b6>] ldlm_resource_putref+0x1b6/0x3c0 [ptlrpc]
 [<ffffffff88702eb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
 [<ffffffff886ff7fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
 [<ffffffff88727720>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
 [<ffffffff88724849>] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc]
 [<ffffffff88a3ab20>] mds_handle+0x4130/0x4d60 [mds]
 [<ffffffff88633be5>] lnet_match_blocked_msg+0x375/0x390 [lnet]
 [<ffffffff88748705>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
 [<ffffffff8874fc37>] ptlrpc_server_handle_request+0xaa7/0x1150 [ptlrpc]
 [<ffffffff800470f3>] try_to_wake_up+0x472/0x484
 [<ffffffff8008b4a5>] __wake_up_common+0x3e/0x68
 [<ffffffff88753698>] ptlrpc_main+0x1258/0x1420 [ptlrpc]
 [<ffffffff8008d07b>] default_wake_function+0x0/0xe
 [<ffffffff800b7a9c>] audit_syscall_exit+0x336/0x362
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff88752440>] ptlrpc_main+0x0/0x1420 [ptlrpc]
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

LustreError: dumping log to /tmp/lustre-log.1328058813.25775
Lustre: Service thread pid 25775 completed after 223.60s. This
indicates the system was overloaded (too many service threads, or
there were not enough hardware resources).
Lustre: Skipped 3 previous similar messages

What do these messages mean?

-- 
David Noriega
System Administrator
Computational Biology Initiative
High Performance Computing Center
University of Texas at San Antonio
One UTSA Circle
San Antonio, TX 78249
Office: BSE 3.112
Phone: 210-458-7100
http://www.cbi.utsa.edu