[Lustre-discuss] OSTs hanging while running IOR
Rafael David Tinoco
Rafael.Tinoco at Sun.COM
Wed Sep 9 10:31:23 PDT 2009
Have anyone seen these kind of errors while running IOR or some other
benchmarks:
Im running lustre 1.8.1 on CentOS 5.3.
I have the following configuration:
4 JBDOs J4400 connected to 4 OSSs.
Each OSS has 3 OSTs (raid5 - 8 disks) connected using multipathd, mdadm on
/dev/dm* and using mptfusion driver (for de J4400 JBODS)
Everytime I run:
mpirun -hostfile ./lustre.hosts -np 20 /hpc/IOR -w -r -C -i 2 -b 1000M -t
128k -F -o /work/stripe12/teste
(Specially with -b 1000M)
One of my OSSs crashes, sometimes one, sometimes another. With the following
error:
Sep 9 07:43:40 a01n00 kernel: ll_ost_io_64 D ffff81037fea80c0 0 20381
1 20382 20380 (L-TLB)
Sep 9 07:43:40 a01n00 kernel: ffff81036316b510 0000000000000046
0000000000000003 0000040000000282
Sep 9 07:43:40 a01n00 kernel: 0000000000000100 0000000000000009
ffff81037ac09100 ffff81037fea80c0
Sep 9 07:43:40 a01n00 kernel: 0000088160738e93 0000000000313ec1
ffff81037ac092e8 0000000328b65740
Sep 9 07:43:40 a01n00 kernel: Call Trace:
Sep 9 07:43:40 a01n00 kernel: [<ffffffff80033608>] submit_bio+0xcd/0xd4
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88b14aac>]
:obdfilter:filter_do_bio+0x95c/0xb60
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88ae0f24>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_write_record+0x464/0x4b0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88b014f0>]
:obdfilter:filter_commit_cb+0x0/0x2d0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88031749>]
:jbd:journal_callback_set+0x2d/0x47
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8009daef>]
autoremove_wake_function+0x0/0x2e
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88b15974>]
:obdfilter:filter_direct_io+0xcc4/0xd50
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8892ad70>]
:lquota:filter_quota_acquire+0x0/0x120
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88b17c08>]
:obdfilter:filter_commitrw_write+0x1558/0x25b0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88730d23>]
:lnet:lnet_send+0x973/0x9a0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88790c11>]
:obdclass:class_handle2object+0xd1/0x160
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88abc048>]
:ost:ost_checksum_bulk+0x358/0x590
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88ac2b1e>]
:ost:ost_brw_write+0x1b8e/0x2310
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88837c88>]
:ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88803320>]
:ptlrpc:target_committed_to_req+0x40/0x120
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88abe67c>]
:ost:ost_brw_read+0x182c/0x19e0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8883c025>]
:ptlrpc:lustre_msg_get_version+0x35/0xf0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8883c0e8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88ac60fb>]
:ost:ost_handle+0x2e5b/0x5a70
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88735305>]
:lnet:lnet_match_blocked_msg+0x375/0x390
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88811aea>]
:ptlrpc:ldlm_resource_foreach+0x25a/0x390
Sep 9 07:43:40 a01n00 kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28
Sep 9 07:43:40 a01n00 kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28
Sep 9 07:43:40 a01n00 kernel: [<ffffffff80088f32>]
find_busiest_group+0x20d/0x621
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88841a15>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8884672d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88848e67>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Sep 9 07:43:40 a01n00 kernel: [<ffffffff80063098>] thread_return+0x62/0xfe
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8003dc3f>]
lock_timer_base+0x1b/0x3c
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8001ceb8>] __mod_timer+0xb0/0xbe
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8884c908>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8884b6f0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Sep 9 07:43:40 a01n00 kernel:
Sep 9 07:43:40 a01n00 kernel: Lustre: 0:0:(watchdog.c:181:lcw_cb())
Watchdog triggered for pid 27733: it was inactive for 200.00s
Sep 9 07:43:40 a01n00 kernel: Lustre:
0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process
27733
Sep 9 07:43:40 a01n00 kernel: ll_ost_io_159 D 0000000000000000 0 27733
1 27734 27732 (L-TLB)
Sep 9 07:43:40 a01n00 kernel: ffff810521239510 0000000000000046
0000000000000003 0000040000000282
Sep 9 07:43:40 a01n00 kernel: 0000000000000100 000000000000000a
ffff81067e810860 ffff81033115a040
Sep 9 07:43:40 a01n00 kernel: 00000881604f2d64 00000000000d2465
ffff81067e810a48 000000061ced4140
Sep 9 07:43:40 a01n00 kernel: Call Trace:
Sep 9 07:43:40 a01n00 kernel: [<ffffffff80033608>] submit_bio+0xcd/0xd4
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88b14aac>]
:obdfilter:filter_do_bio+0x95c/0xb60
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88ae0f24>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_write_record+0x464/0x4b0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88b014f0>]
:obdfilter:filter_commit_cb+0x0/0x2d0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88031749>]
:jbd:journal_callback_set+0x2d/0x47
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8009daef>]
autoremove_wake_function+0x0/0x2e
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88b15974>]
:obdfilter:filter_direct_io+0xcc4/0xd50
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8892ad70>]
:lquota:filter_quota_acquire+0x0/0x120
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88b17c08>]
:obdfilter:filter_commitrw_write+0x1558/0x25b0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88ac2b1e>]
:ost:ost_brw_write+0x1b8e/0x2310
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88837c88>]
:ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88803320>]
:ptlrpc:target_committed_to_req+0x40/0x120
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88abe67c>]
:ost:ost_brw_read+0x182c/0x19e0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8883c025>]
:ptlrpc:lustre_msg_get_version+0x35/0xf0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8883c0e8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88ac60fb>]
:ost:ost_handle+0x2e5b/0x5a70
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88735305>]
:lnet:lnet_match_blocked_msg+0x375/0x390
Sep 9 07:43:40 a01n00 kernel: [<ffffffff800d74d2>]
__drain_alien_cache+0x51/0x66
Sep 9 07:43:40 a01n00 kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88841a15>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff80089d89>] enqueue_task+0x41/0x56
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8884672d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Sep 9 07:43:40 a01n00 kernel: [<ffffffff88848e67>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Sep 9 07:43:40 a01n00 kernel: [<ffffffff80088819>]
__wake_up_common+0x3e/0x68
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8884c908>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8884b6f0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Sep 9 07:43:40 a01n00 kernel: ll_ost_io_195 D ffff81038ab8c860 0 27769
1 27770 27768 (L-TLB)
Sep 9 07:43:40 a01n00 kernel: ffff81028a541190 0000000000000046
ffff81028a541120 ffffffff8009daf8
Sep 9 07:43:40 a01n00 kernel: ffff810369dc3b18 000000000000000a
ffff81028a524820 ffff81038ab8c860
Sep 9 07:43:40 a01n00 kernel: 00000881659b85ee 0000000000000429
ffff81028a524a08 0000000000000003
Sep 9 07:43:40 a01n00 kernel: Call Trace:
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8009daf8>]
autoremove_wake_function+0x9/0x2e
Sep 9 07:43:40 a01n00 kernel: [<ffffffff8002e6ba>] __wake_up+0x38/0x4f
Sep 9 07:43:41 a01n00 kernel: [<ffffffff881b8b39>]
:dm_mod:dm_table_unplug_all+0x33/0x42
Sep 9 07:43:41 a01n00 kernel: [<ffffffff886b5e62>]
:raid456:get_active_stripe+0x247/0x4f0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe
Sep 9 07:43:41 a01n00 kernel: [<ffffffff886bb4dd>]
:raid456:make_request+0x472/0x9af
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8009daef>]
autoremove_wake_function+0x0/0x2e
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8009daef>]
autoremove_wake_function+0x0/0x2e
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8001c49b>]
generic_make_request+0x1e7/0x1fe
Sep 9 07:43:41 a01n00 kernel: [<ffffffff80023342>] mempool_alloc+0x24/0xda
Sep 9 07:43:41 a01n00 kernel: [<ffffffff80033608>] submit_bio+0xcd/0xd4
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88788656>]
:obdclass:lprocfs_oh_tally+0x26/0x50
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88adf7bc>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_send_bio+0xc/0x20
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88b14711>]
:obdfilter:filter_do_bio+0x5c1/0xb60
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88ae0f24>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_write_record+0x464/0x4b0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88b014f0>]
:obdfilter:filter_commit_cb+0x0/0x2d0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88031749>]
:jbd:journal_callback_set+0x2d/0x47
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88adfad0>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_commit_async+0xd0/0x150
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88b15974>]
:obdfilter:filter_direct_io+0xcc4/0xd50
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8892ad70>]
:lquota:filter_quota_acquire+0x0/0x120
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88b17c08>]
:obdfilter:filter_commitrw_write+0x1558/0x25b0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88730d23>]
:lnet:lnet_send+0x973/0x9a0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88790c11>]
:obdclass:class_handle2object+0xd1/0x160
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88abc02c>]
:ost:ost_checksum_bulk+0x33c/0x590
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88ac2b1e>]
:ost:ost_brw_write+0x1b8e/0x2310
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88837c88>]
:ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88803320>]
:ptlrpc:target_committed_to_req+0x40/0x120
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88abe67c>]
:ost:ost_brw_read+0x182c/0x19e0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8883c025>]
:ptlrpc:lustre_msg_get_version+0x35/0xf0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8883c0e8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88ac60fb>]
:ost:ost_handle+0x2e5b/0x5a70
Sep 9 07:43:41 a01n00 kernel: [<ffffffff800d7290>] free_block+0x126/0x143
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88735305>]
:lnet:lnet_match_blocked_msg+0x375/0x390
Sep 9 07:43:41 a01n00 kernel: [<ffffffff800d74d2>]
__drain_alien_cache+0x51/0x66
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88790c11>]
:obdclass:class_handle2object+0xd1/0x160
Sep 9 07:43:41 a01n00 kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28
Sep 9 07:43:41 a01n00 kernel: [<ffffffff80088f32>]
find_busiest_group+0x20d/0x621
Sep 9 07:43:41 a01n00 kernel: [<ffffffff887f719a>]
:ptlrpc:lock_res_and_lock+0xba/0xd0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88841a15>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff80089d89>] enqueue_task+0x41/0x56
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8884672d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Sep 9 07:43:41 a01n00 kernel: [<ffffffff88848e67>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Sep 9 07:43:41 a01n00 kernel: [<ffffffff80088819>]
__wake_up_common+0x3e/0x68
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8884c908>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8884b6f0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Sep 9 07:43:41 a01n00 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Sep 9 07:43:41 a01n00 kernel:
Sep 9 07:43:41 a01n00 kernel: ll_ost_io_68 D 0000000000000000 0 20385
1 20386 20384 (L-TLB)
Sep 9 07:43:41 a01n00 kernel: ffff810375ce5510 0000000000000046
0000000000000003 0000040000000282
Sep 9 07:43:41 a01n00 kernel: 0000000000000100 000000000000000a
ffff81037a2ca080 ffff810365f9e860
Sep 9 07:43:41 a01n00 kernel: 000008815549e040 00000000000df2e2
ffff81037a2ca268 0000000730b8cd40
...
Any Ideas ?
Tks
Rafael Tinoco
Rafael David Tinoco - Sun Microsystems
Systems Engineer - High Performance Computing
Rafael.Tinoco at Sun.COM - 55.11.5187.2194
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090909/aeedb904/attachment.htm>
More information about the lustre-discuss
mailing list