[Lustre-discuss] ll_ost thread soft lockup

Kevin Van Maren KVanMaren at fusionio.com
Mon Mar 19 06:28:22 PDT 2012


You have a bad disk, at 3:11, and then at 9:27 you got a soft lockup (doing a read?), you failed the bad disk at 10:03, rebuild finished at 14:25, and then you got another soft lockup at 19:10 (different stack trace).

You are running 1.8.5, which does not have the fix for the known MD raid5/6 rebuild corruption bug.  That fix was released in the Oracle Lustre 1.8.7 kernel patches.  Unless you already applied that patch, you might want to run a check of your raid arrays and consider an upgrade (at least patch your kernel with that fix).

md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series (note that this bug is NOT specific to rhel5).  This fix does NOT appear to have been picked up by whamcloud.

Kevin


On Mar 18, 2012, at 9:22 PM, Tae Young Hong wrote:

Recently I found that on an OSS, ll_ost  thread got soft-lockup, which was started after RAID rebuilding. After rebooting the OSS on which this OST belongs to, we ran e2fsck with "-fn" and "-fp"
But this symptom still occurs. in some cases, we could see  ll_ost_creat got hung instead of ll_ost after reboot.
during this soft lockup occurred, we couldn't usually delete or create files from/to  this OST has owned (lfs find -O  ...uuid). and when I tried to run ll_recover_lost_found_objs after mounting the OST with ldiskfs, we could see the recovery thread got soft lockup.

kerne and Lustre ver: kernel 2.6.18-194.17.1.el5_lustre.1.8.5smp
HW: J4400 and Sun x4270 for OSS
There is no other hardware problem found.

#########  disk error occurred
Mar 13 03:11:26 oss20 kernel: mptbase: ioc2: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Mar 13 03:11:46 oss20 kernel: sd 2:0:27:0: SCSI error: return code = 0x08000002
Mar 13 03:11:46 oss20 kernel: sdbx: Current: sense key: Medium Error
Mar 13 03:11:46 oss20 kernel:     Add. Sense: Unrecovered read error
Mar 13 03:11:46 oss20 kernel:
Mar 13 03:11:46 oss20 kernel: Info fld=0x5656e6b4
Mar 13 03:11:46 oss20 kernel: end_request: I/O error, dev sdbx, sector 1448535732
Mar 13 03:11:47 oss20 kernel: raid5:md17: read error corrected (8 sectors at 1448535728 on sdbx)
Mar 13 03:11:47 oss20 kernel: raid5:md17: read error corrected (8 sectors at 1448535736 on sdbx)
…


########### call trace occurred
Mar 13 09:12:52 oss20 kernel: Lustre: scratch-OST0034: haven't heard from client scratch-mdtlov_UUID (at 10.3.52.161 at o2ib) in 227 seconds. I think it's dead, and I am evicting it.
Mar 13 09:12:52 oss20 kernel: Lustre: Skipped 2 previous similar messages
Mar 13 09:13:47 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff810450875000 x1390684636813427/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 224/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:13:47 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff8101e17a5400 x1390684636813440/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:13:57 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff81036449d000 x1390684636813453/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:14:34 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff8104dec0f000 x1390684636813541/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 288/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:14:38 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff81010fd4b800 x1390684636813561/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:18:55 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff810367104c00 x1390684636814090/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 576/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:20:02 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff810453592000 x1390684636814172/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 320/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:20:23 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 147s: evicting client at 10.3.27.16 at o2ib  ns: filter-scratch-OST0034_UUID lock: ffff810239b84e00/0x8134b24415b8c002 lrc: 3/0,0 mode: PW/PW res: 8770349/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80000020 remote: 0x2eb65052eeaac8dc expref: 17 pid: 17958 timeout 9632848008
Mar 13 09:20:24 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff810341db8c00 x1390684636814189/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 224/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:20:25 oss20 kernel: LustreError: 17878:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107)  req at ffff81022a56ac00 x1390697545930782/t0 o101-><?>@<?>:0/0 lens 296/0 e 0 to 0 dl 1331598031 ref 1 fl Interpret:/0/0 rc -107/0
Mar 13 09:22:07 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.3.30.72 at o2ib  ns: filter-scratch-OST0034_UUID lock: ffff81014a9c8000/0x8134b244156839db lrc: 3/0,0 mode: PW/PW res: 8770383/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020 remote: 0xb5fbb5ddee10ed12 expref: 24 pid: 17841 timeout 9632952155
Mar 13 09:22:08 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff81016e6d1000 x1390684636814642/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 352/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:22:08 oss20 kernel: LustreError: 17966:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107)  req at ffff810217882c00 x1390697597420037/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1331598144 ref 1 fl Interpret:H/0/0 rc -107/0
Mar 13 09:22:11 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.3.15.9 at o2ib  ns: filter-scratch-OST0034_UUID lock: ffff8101b899ce00/0x8134b2441567880a lrc: 3/0,0 mode: PW/PW res: 8770073/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0xffb69ae75cb1e32e expref: 9 pid: 17861 timeout 9632956878
Mar 13 09:22:12 oss20 kernel: LustreError: 17951:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107)  req at ffff81037d4f7000 x1390697363779506/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1331598148 ref 1 fl Interpret:H/0/0 rc -107/0
Mar 13 09:25:09 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff8104b2367000 x1390684636814999/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 352/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:27:44 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.3.11.76 at o2ib  ns: filter-scratch-OST0034_UUID lock: ffff810460625c00/0x8134b24415b86381 lrc: 3/0,0 mode: PW/PW res: 8798589/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x237e7fdbefcbfed expref: 8 pid: 17872 timeout 9633289138
Mar 13 09:27:49 oss20 kernel: Lustre: Service thread pid 18061 was inactive for 442.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 13 09:27:49 oss20 kernel: Pid: 18061, comm: ll_ost_io_92
Mar 13 09:27:49 oss20 kernel:
Mar 13 09:27:49 oss20 kernel: Call Trace:
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8006462b>] __down_write_nested+0x7a/0x92
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8b2b99c0>] filter_destroy+0x9b0/0x1fb0 [obdfilter]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afb4980>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afb8060>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afdda74>] lustre_msg_add_version+0x34/0x110 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afe077a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8af9befc>] __ldlm_handle2lock+0x8c/0x360 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afe0889>] lustre_pack_reply+0x29/0xb0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8b26b050>] ost_destroy+0x660/0x790 [ost]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afdcbc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afdca15>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8b274a36>] ost_handle+0x1556/0x55b0 [ost]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff80150e00>] __next_cpu+0x19/0x28
Mar 13 09:27:49 oss20 kernel:  [<ffffffff800767e2>] smp_send_reschedule+0x4e/0x53
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afec15a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afec8a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8008b403>] __wake_up_common+0x3e/0x68
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afed817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8005e01d>] child_rip+0xa/0x11



############## call trace occurred for the 2nd time
Mar 13 09:29:18 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff810502953400 x1390684636815395/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:29:18 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Mar 13 09:29:34 oss20 kernel: Lustre: Service thread pid 17995 was inactive for 442.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 13 09:29:34 oss20 kernel: Pid: 17995, comm: ll_ost_io_26
Mar 13 09:29:34 oss20 kernel:
Mar 13 09:29:34 oss20 kernel: Call Trace:
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8006e20b>] do_gettimeofday+0x40/0x90
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80028ace>] sync_page+0x0/0x42
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80063802>] io_schedule+0x3f/0x67
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80028b0c>] sync_page+0x3e/0x42
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80063946>] __wait_on_bit_lock+0x36/0x66
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8003fc3c>] __lock_page+0x5e/0x64
Mar 13 09:29:34 oss20 kernel:  [<ffffffff800a0ab3>] wake_bit_function+0x0/0x23
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80013b32>] find_lock_page+0x69/0xa2
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80025bba>] find_or_create_page+0x22/0x72
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b2c5315>] filter_get_page+0x35/0x70 [obdfilter]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b2c5951>] filter_preprw_read+0x601/0xd30 [obdfilter]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8aecd3a0>] LNetPut+0x710/0x820 [lnet]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afd7ed3>] ptl_send_buf+0x3f3/0x5b0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afe2765>] lustre_msg_set_limit+0x35/0xf0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b2c7dcc>] filter_preprw+0x1d4c/0x1db0 [obdfilter]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afd8658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afe077a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afd8658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b26dab3>] ost_brw_read+0xb33/0x1a70 [ost]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afdca15>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8008cfd9>] default_wake_function+0x0/0xe
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afdcbc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b276353>] ost_handle+0x2e73/0x55b0 [ost]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80150e00>] __next_cpu+0x19/0x28
Mar 13 09:29:34 oss20 kernel:  [<ffffffff800767e2>] smp_send_reschedule+0x4e/0x53
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afec15a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afec8a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8008b403>] __wake_up_common+0x3e/0x68
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afed817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc]
Mar 13 09:29:35 oss20 kernel:  [<ffffffff8005e01d>] child_rip+0xa/0x11
Mar 13 09:29:35 oss20 kernel:  [<ffffffff8afec8e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc]
Mar 13 09:29:35 oss20 kernel:  [<ffffffff8005e013>] child_rip+0x0/0x11
Mar 13 09:29:35 oss20 kernel:
Mar 13 09:30:05 oss20 kernel: Lustre: 18022:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-239), not sending early reply
Mar 13 09:30:05 oss20 kernel:   req at ffff8104e15ef000 x1390697425321956/t0 o4->104e3eaf-521c-6fd2-0ded-c959aeb89264 at NET_0x500000a03122f_UUID:0/0 lens 448/416 e 5 to 0 dl 1331598610 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:05 oss20 kernel: Lustre: 18022:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-239), not sending early reply
Mar 13 09:30:05 oss20 kernel:   req at ffff81041144d400 x1390697425321961/t0 o4->104e3eaf-521c-6fd2-0ded-c959aeb89264 at NET_0x500000a03122f_UUID:0/0 lens 448/416 e 5 to 0 dl 1331598610 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:07 oss20 kernel: Lustre: 17982:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-49), not sending early reply
Mar 13 09:30:07 oss20 kernel:   req at ffff81025e7f6c50 x1387502968075317/t0 o4->e6cc42f9-1723-0037-1bef-8c11b5bb0cd5@:0/0 lens 448/416 e 2 to 0 dl 1331598612 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:07 oss20 kernel: Lustre: 17982:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages
Mar 13 09:30:09 oss20 kernel: Lustre: 18027:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-50), not sending early reply
Mar 13 09:30:09 oss20 kernel:   req at ffff8105c4a83800 x1390670171393616/t0 o4->0c31049a-ac01-884e-a74a-f549ca801551@:0/0 lens 448/416 e 2 to 0 dl 1331598614 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:09 oss20 kernel: Lustre: 18027:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 8 previous similar messages
Mar 13 09:30:26 oss20 kernel: Lustre: 18079:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-196), not sending early reply
Mar 13 09:30:26 oss20 kernel:   req at ffff81047934d800 x1390668976327099/t0 o4->240e8b56-f2f0-beab-0460-c95e7af9ebf2 at NET_0x500000a031347_UUID:0/0 lens 448/416 e 3 to 0 dl 1331598631 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:26 oss20 kernel: Lustre: 18079:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages
Mar 13 09:30:56 oss20 kernel: Lustre: 18021:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-157), not sending early reply
Mar 13 09:30:56 oss20 kernel:   req at ffff810301bdd850 x1390697295264668/t0 o4->819aa1e7-48e9-c8c1-c639-4d7767aac076 at NET_0x500000a030b07_UUID:0/0 lens 448/416 e 3 to 0 dl 1331598661 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:56 oss20 kernel: Lustre: 18021:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages
Mar 13 09:31:32 oss20 kernel: Lustre: 18039:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-108), not sending early reply
Mar 13 09:31:32 oss20 kernel:   req at ffff810471a57000 x1390668308715230/t0 o4->80fad2a1-6a67-0aad-69a7-2ea3c9cf6909 at NET_0x500000a030e16_UUID:0/0 lens 448/416 e 1 to 0 dl 1331598697 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:31:32 oss20 kernel: Lustre: 18039:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
Mar 13 09:32:04 oss20 kernel: Lustre: 18046:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-49), not sending early reply
Mar 13 09:32:04 oss20 kernel:   req at ffff81010eda4800 x1390697612903849/t0 o4->cdc6f8ec-952c-4d6a-7fd2-dca6fe51a617 at NET_0x500000a032018_UUID:0/0 lens 448/416 e 2 to 0 dl 1331598729 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:32:04 oss20 kernel: Lustre: 18046:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 13 previous similar messages
Mar 13 09:32:05 oss20 kernel: Lustre: 17923:0:(ldlm_lib.c:574:target_handle_reconnect()) scratch-OST0034: 104e3eaf-521c-6fd2-0ded-c959aeb89264 reconnecting
Mar 13 09:32:05 oss20 kernel: Lustre: 17923:0:(ldlm_lib.c:874:target_handle_connect()) scratch-OST0034: refuse reconnection from 104e3eaf-521c-6fd2-0ded-c959aeb89264 at 10.3.18.47<mailto:104e3eaf-521c-6fd2-0ded-c959aeb89264 at 10.3.18.47>@o2ib to 0xffff810293ab2000; still busy with 9 active RPCs



################ disk disabled manually by admin using "mdadm /dev/md17 —fail /dev/sdbx"
Mar 13 10:03:22 oss20 kernel: raid5: Disk failure on sdbx, disabling device. Operation continuing on 9 devices
Mar 13 10:03:23 oss20 kernel: RAID5 conf printout:
Mar 13 10:03:23 oss20 kernel:  --- rd:10 wd:9 fd:1
Mar 13 10:03:23 oss20 kernel:  disk 0, o:1, dev:sdbv
Mar 13 10:03:23 oss20 kernel:  disk 1, o:1, dev:sdbw
Mar 13 10:03:23 oss20 kernel:  disk 2, o:0, dev:sdbx
Mar 13 10:03:23 oss20 kernel:  disk 3, o:1, dev:sdby
Mar 13 10:03:23 oss20 kernel:  disk 4, o:1, dev:sdbz
Mar 13 10:03:23 oss20 kernel:  disk 5, o:1, dev:sdca
Mar 13 10:03:23 oss20 kernel:  disk 6, o:1, dev:sdcb
Mar 13 10:03:23 oss20 kernel:  disk 7, o:1, dev:sdcc
Mar 13 10:03:23 oss20 kernel:  disk 8, o:1, dev:sdcd
Mar 13 10:03:23 oss20 kernel:  disk 9, o:1, dev:sdce
Mar 13 10:03:23 oss20 kernel: RAID5 conf printout:
Mar 13 10:03:23 oss20 kernel:  --- rd:10 wd:9 fd:1
Mar 13 10:03:23 oss20 kernel:  disk 0, o:1, dev:sdbv
Mar 13 10:03:23 oss20 kernel:  disk 1, o:1, dev:sdbw
Mar 13 10:03:23 oss20 kernel:  disk 3, o:1, dev:sdby
Mar 13 10:03:23 oss20 kernel:  disk 4, o:1, dev:sdbz
Mar 13 10:03:23 oss20 kernel:  disk 5, o:1, dev:sdca
Mar 13 10:03:23 oss20 kernel:  disk 6, o:1, dev:sdcb
Mar 13 10:03:23 oss20 kernel:  disk 7, o:1, dev:sdcc
Mar 13 10:03:23 oss20 kernel:  disk 8, o:1, dev:sdcd
Mar 13 10:03:23 oss20 kernel:  disk 9, o:1, dev:sdce
Mar 13 10:03:23 oss20 kernel: RAID5 conf printout:
Mar 13 10:03:23 oss20 kernel:  --- rd:10 wd:9 fd:1
Mar 13 10:03:23 oss20 kernel:  disk 0, o:1, dev:sdbv
Mar 13 10:03:23 oss20 kernel:  disk 1, o:1, dev:sdbw
Mar 13 10:03:23 oss20 kernel:  disk 2, o:1, dev:sdcp
Mar 13 10:03:23 oss20 kernel:  disk 3, o:1, dev:sdby
Mar 13 10:03:23 oss20 kernel:  disk 4, o:1, dev:sdbz
Mar 13 10:03:23 oss20 kernel:  disk 5, o:1, dev:sdca
Mar 13 10:03:23 oss20 kernel:  disk 6, o:1, dev:sdcb
Mar 13 10:03:23 oss20 kernel:  disk 7, o:1, dev:sdcc
Mar 13 10:03:23 oss20 kernel:  disk 8, o:1, dev:sdcd
Mar 13 10:03:23 oss20 kernel:  disk 9, o:1, dev:sdce
Mar 13 10:03:23 oss20 kernel: md: syncing RAID array md17


############## md rebuild done
Mar 13 14:07:22 oss20 kernel: md: rebuild md17 throttled due to IO
Mar 13 14:09:24 oss20 kernel: md: unbind<sdbx>
Mar 13 14:09:24 oss20 kernel: md: export_rdev(sdbx)
Mar 13 14:17:24 oss20 kernel: md: rebuild md17 throttled due to IO
Mar 13 14:25:48 oss20 kernel: md: md17: sync done.


################ ll_ost kernel thread  soft lockup occurred
Mar 13 19:10:04 oss20 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_29:17868]
Mar 13 19:10:04 oss20 kernel: CPU 2:
Mar 13 19:10:05 oss20 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid456(U) xor(U) raid1(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) autofs4(U) sunrpc(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) cxgb3(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) mptctl(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sg(U) igb(U) mlx4_core(U) i2c_i801(U) 8021q(U) i2c_core(U) dca(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) ahci(U) libata(U) mptfc(U) scsi_transport_fc(U) mptspi(U) scsi_transport_spi(U) shpchp(U
Mar 13 19:10:05 oss20 kernel:  mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Mar 13 19:10:05 oss20 kernel: Pid: 17868, comm: ll_ost_29 Tainted: G      2.6.18-194.17.1.el5_lustre.1.8.5smp #1
Mar 13 19:10:05 oss20 kernel: RIP: 0010:[<ffffffff8b218cf4>]  [<ffffffff8b218cf4>] :ldiskfs:ldiskfs_find_entry+0x1d4/0x5b0
Mar 13 19:10:05 oss20 kernel: RSP: 0018:ffff8102c7f637e0  EFLAGS: 00000202
Mar 13 19:10:05 oss20 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000036f3f4cb
Mar 13 19:10:05 oss20 kernel: RDX: ffff8104fa4d1c00 RSI: ffff8102c7f637c0 RDI: ffff81010c43e7a8
Mar 13 19:10:05 oss20 kernel: RBP: ffff810254a1d7f0 R08: ffff8100708a5ff8 R09: ffff8100708a5000
Mar 13 19:10:05 oss20 kernel: R10: ffff81010c618038 R11: 000000004f5f1cf2 R12: 0000000000000000
Mar 13 19:10:05 oss20 kernel: R13: 0000000000000002 R14: ffff8101d7e17cd0 R15: ffffffff80063ac8
Mar 13 19:10:05 oss20 kernel: FS:  00002b99048b1220(0000) GS:ffff81010c499240(0000) knlGS:0000000000000000
Mar 13 19:10:05 oss20 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 13 19:10:05 oss20 kernel: CR2: 000000316a03b6a0 CR3: 00000003738ff000 CR4: 00000000000006e0
Mar 13 19:10:05 oss20 kernel:
Mar 13 19:10:05 oss20 kernel: Call Trace:
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8001a81c>] vsnprintf+0x5e7/0x62f
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b21ab23>] :ldiskfs:ldiskfs_lookup+0x53/0x281
Mar 13 19:10:05 oss20 kernel:  [<ffffffff80036d8e>] __lookup_hash+0x10b/0x12f
Mar 13 19:10:05 oss20 kernel:  [<ffffffff800e861d>] lookup_one_len+0x53/0x61
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b2b136d>] :obdfilter:filter_fid2dentry+0x42d/0x740
Mar 13 19:10:05 oss20 kernel:  [<ffffffff80286b42>] __down_trylock+0x44/0x4e
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b2cc38b>] :obdfilter:filter_lvbo_init+0x3bb/0x69b
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8aecd3a0>] :lnet:LNetPut+0x710/0x820
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afa22c0>] :ptlrpc:ldlm_resource_get+0x900/0xa60
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b270600>] :ost:ost_blocking_ast+0x0/0x9b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afc12a0>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8af98efa>] :ptlrpc:ldlm_lock_create+0xba/0x9f0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afddfb1>] :ptlrpc:lustre_swab_buf+0x81/0x170
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afbbd30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afbbd30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afc12a0>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b270600>] :ost:ost_blocking_ast+0x0/0x9b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afbe5b0>] :ptlrpc:ldlm_handle_enqueue+0x670/0x1210
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afdcbc8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b2784c3>] :ost:ost_handle+0x4fe3/0x55b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff80150e00>] __next_cpu+0x19/0x28
Mar 13 19:10:05 oss20 kernel:  [<ffffffff800767e2>] smp_send_reschedule+0x4e/0x53
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afec15a>] :ptlrpc:ptlrpc_server_handle_request+0x97a/0xdf0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afec8a8>] :ptlrpc:ptlrpc_wait_event+0x2d8/0x310
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8008b403>] __wake_up_common+0x3e/0x68
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8afed817>] :ptlrpc:ptlrpc_main+0xf37/0x10f0
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8005e01d>] child_rip+0xa/0x11
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8afec8e0>] :ptlrpc:ptlrpc_main+0x0/0x10f0
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8005e013>] child_rip+0x0/0x11
Mar 13 19:10:06 oss20 kernel:



Taeyoung Hong
Senior Researcher
Supercomputing Center, KISTI
Korea
<ATT00001..txt>


Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited.  Email addresses that end with a “-c” identify the sender as a Fusion-io contractor.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20120319/c99aa942/attachment.htm>


More information about the lustre-discuss mailing list