[Lustre-discuss] OSTs hanging while running IOR

Rafael David Tinoco Rafael.Tinoco at Sun.COM
Wed Sep 9 10:31:23 PDT 2009


Have anyone seen these kind of errors while running IOR or some other
benchmarks:

 

Im running lustre 1.8.1 on CentOS 5.3.

 

I have the following configuration:

 

4 JBDOs J4400 connected to 4 OSSs.

 

Each OSS has 3 OSTs (raid5 - 8 disks) connected using multipathd, mdadm on
/dev/dm* and using mptfusion driver (for de J4400 JBODS)

 

Everytime I run:

 

mpirun -hostfile ./lustre.hosts -np 20 /hpc/IOR -w -r -C -i 2 -b 1000M -t
128k -F -o /work/stripe12/teste

(Specially with -b 1000M) 

 

One of my OSSs crashes, sometimes one, sometimes another. With the following
error:

 

Sep  9 07:43:40 a01n00 kernel: ll_ost_io_64  D ffff81037fea80c0     0 20381
1         20382 20380 (L-TLB)

Sep  9 07:43:40 a01n00 kernel:  ffff81036316b510 0000000000000046
0000000000000003 0000040000000282

Sep  9 07:43:40 a01n00 kernel:  0000000000000100 0000000000000009
ffff81037ac09100 ffff81037fea80c0

Sep  9 07:43:40 a01n00 kernel:  0000088160738e93 0000000000313ec1
ffff81037ac092e8 0000000328b65740

Sep  9 07:43:40 a01n00 kernel: Call Trace:

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff80033608>] submit_bio+0xcd/0xd4

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88b14aac>]
:obdfilter:filter_do_bio+0x95c/0xb60

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88ae0f24>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_write_record+0x464/0x4b0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88b014f0>]
:obdfilter:filter_commit_cb+0x0/0x2d0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88031749>]
:jbd:journal_callback_set+0x2d/0x47

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8009daef>]
autoremove_wake_function+0x0/0x2e

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88b15974>]
:obdfilter:filter_direct_io+0xcc4/0xd50

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8892ad70>]
:lquota:filter_quota_acquire+0x0/0x120

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88b17c08>]
:obdfilter:filter_commitrw_write+0x1558/0x25b0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88730d23>]
:lnet:lnet_send+0x973/0x9a0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88790c11>]
:obdclass:class_handle2object+0xd1/0x160

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88abc048>]
:ost:ost_checksum_bulk+0x358/0x590

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88ac2b1e>]
:ost:ost_brw_write+0x1b8e/0x2310

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88837c88>]
:ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88803320>]
:ptlrpc:target_committed_to_req+0x40/0x120

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88abe67c>]
:ost:ost_brw_read+0x182c/0x19e0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8883c025>]
:ptlrpc:lustre_msg_get_version+0x35/0xf0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8883c0e8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88ac60fb>]
:ost:ost_handle+0x2e5b/0x5a70

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88735305>]
:lnet:lnet_match_blocked_msg+0x375/0x390

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88811aea>]
:ptlrpc:ldlm_resource_foreach+0x25a/0x390

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff80148d4f>] __next_cpu+0x19/0x28

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff80148d4f>] __next_cpu+0x19/0x28

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff80088f32>]
find_busiest_group+0x20d/0x621

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88841a15>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8884672d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88848e67>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff80063098>] thread_return+0x62/0xfe

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8003dc3f>]
lock_timer_base+0x1b/0x3c

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8001ceb8>] __mod_timer+0xb0/0xbe

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8884c908>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8884b6f0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Sep  9 07:43:40 a01n00 kernel:

Sep  9 07:43:40 a01n00 kernel: Lustre: 0:0:(watchdog.c:181:lcw_cb())
Watchdog triggered for pid 27733: it was inactive for 200.00s

Sep  9 07:43:40 a01n00 kernel: Lustre:
0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process
27733

Sep  9 07:43:40 a01n00 kernel: ll_ost_io_159 D 0000000000000000     0 27733
1         27734 27732 (L-TLB)

Sep  9 07:43:40 a01n00 kernel:  ffff810521239510 0000000000000046
0000000000000003 0000040000000282

Sep  9 07:43:40 a01n00 kernel:  0000000000000100 000000000000000a
ffff81067e810860 ffff81033115a040

Sep  9 07:43:40 a01n00 kernel:  00000881604f2d64 00000000000d2465
ffff81067e810a48 000000061ced4140

Sep  9 07:43:40 a01n00 kernel: Call Trace:

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff80033608>] submit_bio+0xcd/0xd4

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88b14aac>]
:obdfilter:filter_do_bio+0x95c/0xb60

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88ae0f24>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_write_record+0x464/0x4b0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88b014f0>]
:obdfilter:filter_commit_cb+0x0/0x2d0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88031749>]
:jbd:journal_callback_set+0x2d/0x47

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8009daef>]
autoremove_wake_function+0x0/0x2e

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88b15974>]
:obdfilter:filter_direct_io+0xcc4/0xd50

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8892ad70>]
:lquota:filter_quota_acquire+0x0/0x120

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88b17c08>]
:obdfilter:filter_commitrw_write+0x1558/0x25b0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88ac2b1e>]
:ost:ost_brw_write+0x1b8e/0x2310

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88837c88>]
:ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88803320>]
:ptlrpc:target_committed_to_req+0x40/0x120

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88abe67c>]
:ost:ost_brw_read+0x182c/0x19e0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8883c025>]
:ptlrpc:lustre_msg_get_version+0x35/0xf0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8883c0e8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88ac60fb>]
:ost:ost_handle+0x2e5b/0x5a70

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88735305>]
:lnet:lnet_match_blocked_msg+0x375/0x390

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff800d74d2>]
__drain_alien_cache+0x51/0x66

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff80148d4f>] __next_cpu+0x19/0x28

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88841a15>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff80089d89>] enqueue_task+0x41/0x56

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8884672d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff88848e67>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff80088819>]
__wake_up_common+0x3e/0x68

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8884c908>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8884b6f0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Sep  9 07:43:40 a01n00 kernel: ll_ost_io_195 D ffff81038ab8c860     0 27769
1         27770 27768 (L-TLB)

Sep  9 07:43:40 a01n00 kernel:  ffff81028a541190 0000000000000046
ffff81028a541120 ffffffff8009daf8

Sep  9 07:43:40 a01n00 kernel:  ffff810369dc3b18 000000000000000a
ffff81028a524820 ffff81038ab8c860

Sep  9 07:43:40 a01n00 kernel:  00000881659b85ee 0000000000000429
ffff81028a524a08 0000000000000003

Sep  9 07:43:40 a01n00 kernel: Call Trace:

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8009daf8>]
autoremove_wake_function+0x9/0x2e

Sep  9 07:43:40 a01n00 kernel:  [<ffffffff8002e6ba>] __wake_up+0x38/0x4f

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff881b8b39>]
:dm_mod:dm_table_unplug_all+0x33/0x42

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff886b5e62>]
:raid456:get_active_stripe+0x247/0x4f0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff886bb4dd>]
:raid456:make_request+0x472/0x9af

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8009daef>]
autoremove_wake_function+0x0/0x2e

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8009daef>]
autoremove_wake_function+0x0/0x2e

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8001c49b>]
generic_make_request+0x1e7/0x1fe

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff80023342>] mempool_alloc+0x24/0xda

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff80033608>] submit_bio+0xcd/0xd4

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88788656>]
:obdclass:lprocfs_oh_tally+0x26/0x50

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88adf7bc>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_send_bio+0xc/0x20

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88b14711>]
:obdfilter:filter_do_bio+0x5c1/0xb60

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88ae0f24>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_write_record+0x464/0x4b0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88b014f0>]
:obdfilter:filter_commit_cb+0x0/0x2d0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88031749>]
:jbd:journal_callback_set+0x2d/0x47

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88adfad0>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_commit_async+0xd0/0x150

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88b15974>]
:obdfilter:filter_direct_io+0xcc4/0xd50

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8892ad70>]
:lquota:filter_quota_acquire+0x0/0x120

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88b17c08>]
:obdfilter:filter_commitrw_write+0x1558/0x25b0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88730d23>]
:lnet:lnet_send+0x973/0x9a0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88790c11>]
:obdclass:class_handle2object+0xd1/0x160

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88abc02c>]
:ost:ost_checksum_bulk+0x33c/0x590

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88ac2b1e>]
:ost:ost_brw_write+0x1b8e/0x2310

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88837c88>]
:ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88803320>]
:ptlrpc:target_committed_to_req+0x40/0x120

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88abe67c>]
:ost:ost_brw_read+0x182c/0x19e0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8883c025>]
:ptlrpc:lustre_msg_get_version+0x35/0xf0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8883c0e8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88ac60fb>]
:ost:ost_handle+0x2e5b/0x5a70

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff800d7290>] free_block+0x126/0x143

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88735305>]
:lnet:lnet_match_blocked_msg+0x375/0x390

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff800d74d2>]
__drain_alien_cache+0x51/0x66

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88790c11>]
:obdclass:class_handle2object+0xd1/0x160

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff80148d4f>] __next_cpu+0x19/0x28

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff80088f32>]
find_busiest_group+0x20d/0x621

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff887f719a>]
:ptlrpc:lock_res_and_lock+0xba/0xd0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88841a15>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff80089d89>] enqueue_task+0x41/0x56

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8884672d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff88848e67>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff80088819>]
__wake_up_common+0x3e/0x68

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8884c908>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8884b6f0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0

Sep  9 07:43:41 a01n00 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Sep  9 07:43:41 a01n00 kernel:

Sep  9 07:43:41 a01n00 kernel: ll_ost_io_68  D 0000000000000000     0 20385
1         20386 20384 (L-TLB)

Sep  9 07:43:41 a01n00 kernel:  ffff810375ce5510 0000000000000046
0000000000000003 0000040000000282

Sep  9 07:43:41 a01n00 kernel:  0000000000000100 000000000000000a
ffff81037a2ca080 ffff810365f9e860

Sep  9 07:43:41 a01n00 kernel:  000008815549e040 00000000000df2e2
ffff81037a2ca268 0000000730b8cd40

...

 

Any Ideas ?

 

Tks

 

Rafael Tinoco

 

 

Rafael David Tinoco - Sun Microsystems

Systems Engineer - High Performance Computing

Rafael.Tinoco at Sun.COM - 55.11.5187.2194

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090909/aeedb904/attachment.htm>


More information about the lustre-discuss mailing list