[Lustre-discuss] 1.8.1.1
Papp Tamás
tompos at martos.bme.hu
Fri Nov 27 02:13:29 PST 2009
Craig Prescott wrote, On 2009. 11. 19. 20:42:
> We had the same problem with 1.8.x.x.
>
> We set lnet.printk=0 on our OSS nodes and it has helped us
> dramatically - we have not seen the 'soft lockup' problem since.
>
> sysctl -w lnet.printk=0
>
> This will turn off all but 'emerg' messages from lnet.
>
> It would be interesting to know if this avoided the lockups for you, too.
Well, this definetely helped, but didn't resolve the root of the problem.
A few minutes ago we were not able to reach our cluster from clients.
On MDS I see a lot of this:
Nov 27 10:52:17 meta1 kernel: BUG: soft lockup - CPU#3 stuck for 10s!
[ll_evictor:6123]
Nov 27 10:52:17 meta1 kernel: CPU 3:
Nov 27 10:52:17 meta1 kernel: Modules linked in: mds(U)
fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U)
mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U)
lvfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U)
ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U)
lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U)
dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U)
sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) shpchp(U)
pcspkr(U) sg(U) igb(U) dm_raid45(U) dm_message(U) dm_region_hash(U)
dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) ata_piix(U) ahci(U)
libata(U) sd_mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U)
ohci_hcd(U) ehci_hcd(U)
Nov 27 10:52:17 meta1 kernel: Pid: 6123, comm: ll_evictor Tainted:
G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Nov 27 10:52:17 meta1 kernel: RIP: 0010:[<ffffffff80064af5>]
[<ffffffff80064af5>] _write_lock+0xe/0xf
Nov 27 10:52:17 meta1 kernel: RSP: 0018:ffff8102154d9c78 EFLAGS: 00000246
Nov 27 10:52:17 meta1 kernel: RAX: 000000000000ffff RBX:
00000000000050c3 RCX: 0000000000042620
Nov 27 10:52:17 meta1 kernel: RDX: 0000000000000282 RSI:
ffffffff802fae80 RDI: ffffc200108e4c3c
Nov 27 10:52:17 meta1 kernel: RBP: 0000000000000286 R08:
ffff81000101f3e0 R09: 0000000000000000
Nov 27 10:52:17 meta1 kernel: R10: ffff8101960319c0 R11:
0000000000000150 R12: 0000000000000286
Nov 27 10:52:17 meta1 kernel: R13: ffff8102154d9c20 R14:
ffff810196031a18 R15: ffff8101960319c0
Nov 27 10:52:17 meta1 kernel: FS: 00002abb113e4230(0000)
GS:ffff810107f61540(0000) knlGS:0000000000000000
Nov 27 10:52:17 meta1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Nov 27 10:52:18 meta1 kernel: CR2: 00000031d3803080 CR3:
0000000000201000 CR4: 00000000000006e0
Nov 27 10:52:18 meta1 kernel:
Nov 27 10:52:18 meta1 kernel: Call Trace:
Nov 27 10:52:18 meta1 kernel: [<ffffffff884caba7>]
:obdclass:lustre_hash_for_each_empty+0x237/0x2b0
Nov 27 10:52:18 meta1 kernel: [<ffffffff884d1ae8>]
:obdclass:class_disconnect+0x398/0x420
Nov 27 10:52:18 meta1 kernel: [<ffffffff8886d5e1>]
:mds:mds_disconnect+0x121/0xe40
Nov 27 10:52:18 meta1 kernel: [<ffffffff8014b87a>] snprintf+0x44/0x4c
Nov 27 10:52:18 meta1 kernel: [<ffffffff884cd994>]
:obdclass:class_fail_export+0x384/0x4c0
Nov 27 10:52:18 meta1 kernel: [<ffffffff8859e238>]
:ptlrpc:ping_evictor_main+0x4f8/0x7e0
Nov 27 10:52:18 meta1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Nov 27 10:52:18 meta1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 27 10:52:18 meta1 kernel: [<ffffffff8859dd40>]
:ptlrpc:ping_evictor_main+0x0/0x7e0
Nov 27 10:52:18 meta1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov 27 10:52:18 meta1 kernel:
After them:
Nov 27 10:53:05 meta1 kernel: LustreError:
6009:0:(events.c:367:server_bulk_callback()) event type 4, status -5,
desc ffff81022ed54280
Nov 27 10:53:05 meta1 kernel: Lustre: dafs-MDT0000: haven't heard from
client 4a05e5fb-590d-ec56-c1b3-ff0234152564 (at 192.168.3.141 at tcp) in
258 seconds. I think it's dead, and I am evicting it.
Nov 27 10:53:05 meta1 kernel: LustreError:
6009:0:(events.c:367:server_bulk_callback()) event type 4, status -5,
desc ffff8102189b0cc0
Nov 27 10:53:05 meta1 kernel: LustreError:
6145:0:(handler.c:1563:mds_handle()) operation 400 on unconnected MDS
from 12345-192.168.3.141 at tcp
Nov 27 10:53:05 meta1 kernel: LustreError:
6145:0:(ldlm_lib.c:1863:target_send_reply_msg()) @@@ processing error
(-107) req at ffff8101bec0cc00 x1320487774846997/t0 o400-><?>@<?>:0/0 lens
192/0 e 0 to
0 dl 1259315591 ref 1 fl Interpret:H/0/0 rc -107/0
Nov 27 10:53:05 meta1 kernel: Lustre:
6149:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000:
b4c8f591-df5c-b6f2-93c1-57e5f64312d0 reconnecting
800; still busy with 1 active RPCs
a02b-9917-c60e-d5b6c1c1d3c8@:0/0 lens 408/0 e 0 to 0 dl 0 ref 1 fl
New:/0/0 rc 0/0
Nov 27 10:53:05 meta1 kernel: Lustre:
6282:0:(ldlm_lib.c:766:target_handle_connect()) MGS: exp
ffff81023e83c000 already connecting
l busy with 1 active RPCs
Nov 27 10:53:05 meta1 kernel: Lustre:
6282:0:(ldlm_lib.c:766:target_handle_connect()) Skipped 8 previous
similar messages
Nov 27 10:53:05 meta1 kernel: Lustre:
6121:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 10 previous
similar messages
Nov 27 10:53:05 meta1 kernel: LustreError:
6282:0:(mgs_handler.c:661:mgs_handle()) MGS handle cmd=250 rc=-114
Nov 27 10:53:05 meta1 kernel: LustreError:
6282:0:(mgs_handler.c:661:mgs_handle()) Skipped 9 previous similar messages
3400 x1320390320856467/t0
o37->56d39829-d432-2c01-0261-eee2779a4fe3 at NET_0x20000c0a80390_UUID:0/0
lens 408/360 e 1 to 0 dl 1259315514 ref 1 fl Complete:/0/0 rc -110/-110
Nov 27 10:53:05 meta1 kernel: Lustre:
6841:0:(watchdog.c:363:lcw_update_time()) Expired watchdog for pid 6841
disabled after 208.42s
Nov 27 10:53:05 meta1 kernel: LustreError:
6145:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 17 previous
similar messages
Nov 27 10:53:06 meta1 kernel: Lustre:
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000:
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:06 meta1 kernel: Lustre:
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 185 previous
similar messages
800; still busy with 1 active RPCs
Nov 27 10:53:06 meta1 kernel: Lustre:
6143:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 1725 previous
similar messages
Nov 27 10:53:06 meta1 kernel: Lustre:
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000:
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:06 meta1 kernel: Lustre:
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1763 previous
similar messages
6703b at NET_0x20000c0a80376_UUID:0/0 lens 368/264 e 0 to 0 dl 1259315686
ref 1 fl Interpret:/0/0 rc -16/0
Nov 27 10:53:06 meta1 kernel: LustreError:
6152:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 4173 previous
similar messages
800; still busy with 1 active RPCs
Nov 27 10:53:07 meta1 kernel: Lustre:
6148:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 4913 previous
similar messages
Nov 27 10:53:07 meta1 kernel: Lustre:
6136:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000:
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:07 meta1 kernel: Lustre:
6136:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 4916 previous
similar messages
6703b at NET_0x20000c0a80376_UUID:0/0 lens 368/264 e 0 to 0 dl 1259315688
ref 1 fl Interpret:/0/0 rc -16/0
Nov 27 10:53:08 meta1 kernel: LustreError:
6160:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 9782 previous
similar messages
Nov 27 10:53:08 meta1 kernel: Lustre: MGS: haven't heard from client
f6fe0d62-238f-d5b3-5137-d1d897bc1195 (at 192.168.3.141 at tcp) in 467
seconds. I think it's dead, and I am evicting it.
Nov 27 10:53:08 meta1 kernel: Lustre: Skipped 1 previous similar message
800; still busy with 1 active RPCs
Nov 27 10:53:09 meta1 kernel: Lustre:
6156:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 9712 previous
similar messages
Nov 27 10:53:09 meta1 kernel: Lustre:
6156:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000:
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:09 meta1 kernel: Lustre:
6156:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 9718 previous
similar messages
Nov 27 10:53:11 meta1 kernel: LustreError:
27480:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096),
evicting d3728ce6-b580-dbb3-b262-123d1516703b at NET_0x20000c0a80376_UUID
t:/0/0 rc 0/0
Nov 27 10:54:23 meta1 hpasmpld[2804]: Sensor 9b is missing. Reading
Flags = 40
d3728ce6-b580-dbb3-b262-123d1516703b at NET_0x20000c0a80376_UUID:0/0 lens
408/528 e 0 to 0 dl 1259315709 ref 1 fl Interpret:/0/0 rc 0/0
Nov 27 10:55:03 meta1 kernel: LustreError:
27480:0:(mds_open.c:1679:mds_close()) Skipped 3 previous similar messages
516703b at NET_0x20000c0a80376_UUID:0/0 lens 408/528 e 0 to 0 dl 1259315709
ref 1 fl Interpret:/0/0 rc -116/0
Nov 27 10:55:03 meta1 kernel: LustreError:
27480:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 14495 previous
similar messages
Nov 27 10:56:23 meta1 hpasmpld[2804]: Sensor 9b is missing. Reading
Flags = 40
On node2 a single one of this:
Nov 27 10:52:43 node2 kernel: ll_ost_io_68 S ffff81023c101080 0
6222 1 6223 6221 (L-TLB)
Nov 27 10:52:43 node2 kernel: ffff810209f5b960 0000000000000046
0000000000000001 ffffffff884776b0
Nov 27 10:52:43 node2 kernel: ffff810107f24000 000000000000000a
ffff81022408a7e0 ffff81023c101080
Nov 27 10:52:43 node2 kernel: 000254bcf81e00e9 00000000000003b2
ffff81022408a9c8 0000000100000000
Nov 27 10:52:43 node2 kernel: Call Trace:
Nov 27 10:52:43 node2 kernel: [<ffffffff884776b0>]
:lnet:LNetMDBind+0x2c0/0x420
Nov 27 10:52:43 node2 kernel: [<ffffffff8003d382>]
lock_timer_base+0x1b/0x3c
Nov 27 10:52:43 node2 kernel: [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe
Nov 27 10:52:43 node2 kernel: [<ffffffff80063947>]
schedule_timeout+0x8a/0xad
Nov 27 10:52:43 node2 kernel: [<ffffffff80094f4f>] process_timeout+0x0/0x5
Nov 27 10:52:43 node2 kernel: [<ffffffff888060dc>]
:ost:ost_brw_read+0x127c/0x1a30
Nov 27 10:52:43 node2 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Nov 27 10:52:43 node2 kernel: [<ffffffff885865d8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Nov 27 10:52:43 node2 kernel: [<ffffffff8880e429>]
:ost:ost_handle+0x3129/0x5a70
Nov 27 10:52:43 node2 kernel: [<ffffffff88478923>]
:lnet:lnet_ni_send+0x93/0xd0
Nov 27 10:52:43 node2 kernel: [<ffffffff8847ad23>]
:lnet:lnet_send+0x973/0x9a0
Nov 27 10:52:43 node2 kernel: [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe
Nov 27 10:52:43 node2 kernel: [<ffffffff8847f305>]
:lnet:lnet_match_blocked_msg+0x375/0x390
Nov 27 10:52:43 node2 kernel: [<ffffffff8855be8a>]
:ptlrpc:ldlm_resource_foreach+0x11a/0x390
Nov 27 10:52:43 node2 kernel: [<ffffffff884dadf1>]
:obdclass:class_handle2object+0xd1/0x160
Nov 27 10:52:43 node2 kernel: [<ffffffff80148e8c>] __next_cpu+0x19/0x28
Nov 27 10:52:43 node2 kernel: [<ffffffff80088f36>]
find_busiest_group+0x20d/0x621
Nov 27 10:52:43 node2 kernel: [<ffffffff8854119a>]
:ptlrpc:lock_res_and_lock+0xba/0xd0
Nov 27 10:52:43 node2 kernel: [<ffffffff8858bf05>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Nov 27 10:52:43 node2 kernel: [<ffffffff80089d8d>] enqueue_task+0x41/0x56
Nov 27 10:52:43 node2 kernel: [<ffffffff88590c1d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Nov 27 10:52:43 node2 kernel: [<ffffffff88593357>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Nov 27 10:52:43 node2 kernel: [<ffffffff8008881d>]
__wake_up_common+0x3e/0x68
Nov 27 10:52:44 node2 kernel: [<ffffffff88596e08>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Nov 27 10:52:44 node2 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Nov 27 10:52:44 node2 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 27 10:52:44 node2 kernel: [<ffffffff88595bf0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Nov 27 10:52:44 node2 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov 27 10:52:44 node2 kernel:
Nov 27 10:52:44 node2 kernel: LustreError: dumping log to
/tmp/lustre-log.1259315563.6222
I attach the lustre-log.
I googled the net for LNetMDBind, but I found no useful pages.
I see this in the Understanding_Lustre_Filesystem_Internals.pdf, but
it's not really helpful to me.
int LNetMDBind(
lnet_md_t umd,
lnet_unlink_t unlink,
lnet_handle_md_t *handle)
This function creates a standalone memory descriptor, i.e., an MD that
is not attached
to an ME.
By the way it's a bit strange, nobody else met this bug yet.
tamas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre-log.1259315563.6222.bz2
Type: application/octet-stream
Size: 9859 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091127/d315eee5/attachment.obj>
More information about the lustre-discuss
mailing list