[Lustre-discuss] 1.8.1.1

Papp Tamás tompos at martos.bme.hu
Fri Nov 27 02:13:29 PST 2009


Craig Prescott wrote, On 2009. 11. 19. 20:42:
> We had the same problem with 1.8.x.x.
>
> We set lnet.printk=0 on our OSS nodes and it has helped us 
> dramatically - we have not seen the 'soft lockup' problem since.
>
> sysctl -w lnet.printk=0
>
> This will turn off all but 'emerg' messages from lnet.
>
> It would be interesting to know if this avoided the lockups for you, too.

Well, this definetely helped, but didn't resolve the root of the problem.

A few minutes ago we were not able to reach our cluster from clients.

On MDS I see a lot of this:


Nov 27 10:52:17 meta1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
[ll_evictor:6123]
Nov 27 10:52:17 meta1 kernel: CPU 3:
Nov 27 10:52:17 meta1 kernel: Modules linked in: mds(U) 
fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) 
mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) 
lvfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) 
lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) 
dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) 
sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) 
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) shpchp(U) 
pcspkr(U) sg(U) igb(U) dm_raid45(U) dm_message(U) dm_region_hash(U) 
dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) ata_piix(U) ahci(U) 
libata(U) sd_mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) 
ohci_hcd(U) ehci_hcd(U)
Nov 27 10:52:17 meta1 kernel: Pid: 6123, comm: ll_evictor Tainted: 
G      2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Nov 27 10:52:17 meta1 kernel: RIP: 0010:[<ffffffff80064af5>]  
[<ffffffff80064af5>] _write_lock+0xe/0xf
Nov 27 10:52:17 meta1 kernel: RSP: 0018:ffff8102154d9c78  EFLAGS: 00000246
Nov 27 10:52:17 meta1 kernel: RAX: 000000000000ffff RBX: 
00000000000050c3 RCX: 0000000000042620
Nov 27 10:52:17 meta1 kernel: RDX: 0000000000000282 RSI: 
ffffffff802fae80 RDI: ffffc200108e4c3c
Nov 27 10:52:17 meta1 kernel: RBP: 0000000000000286 R08: 
ffff81000101f3e0 R09: 0000000000000000
Nov 27 10:52:17 meta1 kernel: R10: ffff8101960319c0 R11: 
0000000000000150 R12: 0000000000000286
Nov 27 10:52:17 meta1 kernel: R13: ffff8102154d9c20 R14: 
ffff810196031a18 R15: ffff8101960319c0
Nov 27 10:52:17 meta1 kernel: FS:  00002abb113e4230(0000) 
GS:ffff810107f61540(0000) knlGS:0000000000000000
Nov 27 10:52:17 meta1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Nov 27 10:52:18 meta1 kernel: CR2: 00000031d3803080 CR3: 
0000000000201000 CR4: 00000000000006e0
Nov 27 10:52:18 meta1 kernel:
Nov 27 10:52:18 meta1 kernel: Call Trace:
Nov 27 10:52:18 meta1 kernel:  [<ffffffff884caba7>] 
:obdclass:lustre_hash_for_each_empty+0x237/0x2b0
Nov 27 10:52:18 meta1 kernel:  [<ffffffff884d1ae8>] 
:obdclass:class_disconnect+0x398/0x420
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8886d5e1>] 
:mds:mds_disconnect+0x121/0xe40
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8014b87a>] snprintf+0x44/0x4c
Nov 27 10:52:18 meta1 kernel:  [<ffffffff884cd994>] 
:obdclass:class_fail_export+0x384/0x4c0
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8859e238>] 
:ptlrpc:ping_evictor_main+0x4f8/0x7e0
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8859dd40>] 
:ptlrpc:ping_evictor_main+0x0/0x7e0
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov 27 10:52:18 meta1 kernel:


After them:

Nov 27 10:53:05 meta1 kernel: LustreError: 
6009:0:(events.c:367:server_bulk_callback()) event type 4, status -5, 
desc ffff81022ed54280
Nov 27 10:53:05 meta1 kernel: Lustre: dafs-MDT0000: haven't heard from 
client 4a05e5fb-590d-ec56-c1b3-ff0234152564 (at 192.168.3.141 at tcp) in 
258 seconds. I think it's dead, and I am evicting it.
Nov 27 10:53:05 meta1 kernel: LustreError: 
6009:0:(events.c:367:server_bulk_callback()) event type 4, status -5, 
desc ffff8102189b0cc0
Nov 27 10:53:05 meta1 kernel: LustreError: 
6145:0:(handler.c:1563:mds_handle()) operation 400 on unconnected MDS 
from 12345-192.168.3.141 at tcp
Nov 27 10:53:05 meta1 kernel: LustreError: 
6145:0:(ldlm_lib.c:1863:target_send_reply_msg()) @@@ processing error 
(-107)  req at ffff8101bec0cc00 x1320487774846997/t0 o400-><?>@<?>:0/0 lens 
192/0 e 0 to
 0 dl 1259315591 ref 1 fl Interpret:H/0/0 rc -107/0
Nov 27 10:53:05 meta1 kernel: Lustre: 
6149:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: 
b4c8f591-df5c-b6f2-93c1-57e5f64312d0 reconnecting
800; still busy with 1 active RPCs
a02b-9917-c60e-d5b6c1c1d3c8@:0/0 lens 408/0 e 0 to 0 dl 0 ref 1 fl 
New:/0/0 rc 0/0
Nov 27 10:53:05 meta1 kernel: Lustre: 
6282:0:(ldlm_lib.c:766:target_handle_connect()) MGS: exp 
ffff81023e83c000 already connecting
l busy with 1 active RPCs
Nov 27 10:53:05 meta1 kernel: Lustre: 
6282:0:(ldlm_lib.c:766:target_handle_connect()) Skipped 8 previous 
similar messages
Nov 27 10:53:05 meta1 kernel: Lustre: 
6121:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 10 previous 
similar messages
Nov 27 10:53:05 meta1 kernel: LustreError: 
6282:0:(mgs_handler.c:661:mgs_handle()) MGS handle cmd=250 rc=-114
Nov 27 10:53:05 meta1 kernel: LustreError: 
6282:0:(mgs_handler.c:661:mgs_handle()) Skipped 9 previous similar messages
3400 x1320390320856467/t0 
o37->56d39829-d432-2c01-0261-eee2779a4fe3 at NET_0x20000c0a80390_UUID:0/0 
lens 408/360 e 1 to 0 dl 1259315514 ref 1 fl Complete:/0/0 rc -110/-110
Nov 27 10:53:05 meta1 kernel: Lustre: 
6841:0:(watchdog.c:363:lcw_update_time()) Expired watchdog for pid 6841 
disabled after 208.42s
Nov 27 10:53:05 meta1 kernel: LustreError: 
6145:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 17 previous 
similar messages
Nov 27 10:53:06 meta1 kernel: Lustre: 
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: 
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:06 meta1 kernel: Lustre: 
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 185 previous 
similar messages
800; still busy with 1 active RPCs
Nov 27 10:53:06 meta1 kernel: Lustre: 
6143:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 1725 previous 
similar messages
Nov 27 10:53:06 meta1 kernel: Lustre: 
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: 
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:06 meta1 kernel: Lustre: 
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1763 previous 
similar messages
6703b at NET_0x20000c0a80376_UUID:0/0 lens 368/264 e 0 to 0 dl 1259315686 
ref 1 fl Interpret:/0/0 rc -16/0
Nov 27 10:53:06 meta1 kernel: LustreError: 
6152:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 4173 previous 
similar messages
800; still busy with 1 active RPCs
Nov 27 10:53:07 meta1 kernel: Lustre: 
6148:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 4913 previous 
similar messages
Nov 27 10:53:07 meta1 kernel: Lustre: 
6136:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: 
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:07 meta1 kernel: Lustre: 
6136:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 4916 previous 
similar messages
6703b at NET_0x20000c0a80376_UUID:0/0 lens 368/264 e 0 to 0 dl 1259315688 
ref 1 fl Interpret:/0/0 rc -16/0
Nov 27 10:53:08 meta1 kernel: LustreError: 
6160:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 9782 previous 
similar messages
Nov 27 10:53:08 meta1 kernel: Lustre: MGS: haven't heard from client 
f6fe0d62-238f-d5b3-5137-d1d897bc1195 (at 192.168.3.141 at tcp) in 467 
seconds. I think it's dead, and I am evicting it.
Nov 27 10:53:08 meta1 kernel: Lustre: Skipped 1 previous similar message
800; still busy with 1 active RPCs
Nov 27 10:53:09 meta1 kernel: Lustre: 
6156:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 9712 previous 
similar messages
Nov 27 10:53:09 meta1 kernel: Lustre: 
6156:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: 
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:09 meta1 kernel: Lustre: 
6156:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 9718 previous 
similar messages
Nov 27 10:53:11 meta1 kernel: LustreError: 
27480:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), 
evicting d3728ce6-b580-dbb3-b262-123d1516703b at NET_0x20000c0a80376_UUID
t:/0/0 rc 0/0
Nov 27 10:54:23 meta1 hpasmpld[2804]: Sensor 9b is missing. Reading 
Flags = 40
d3728ce6-b580-dbb3-b262-123d1516703b at NET_0x20000c0a80376_UUID:0/0 lens 
408/528 e 0 to 0 dl 1259315709 ref 1 fl Interpret:/0/0 rc 0/0
Nov 27 10:55:03 meta1 kernel: LustreError: 
27480:0:(mds_open.c:1679:mds_close()) Skipped 3 previous similar messages
516703b at NET_0x20000c0a80376_UUID:0/0 lens 408/528 e 0 to 0 dl 1259315709 
ref 1 fl Interpret:/0/0 rc -116/0
Nov 27 10:55:03 meta1 kernel: LustreError: 
27480:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 14495 previous 
similar messages
Nov 27 10:56:23 meta1 hpasmpld[2804]: Sensor 9b is missing. Reading 
Flags = 40


On node2 a single one of this:

Nov 27 10:52:43 node2 kernel: ll_ost_io_68  S ffff81023c101080     0  
6222      1          6223  6221 (L-TLB)
Nov 27 10:52:43 node2 kernel:  ffff810209f5b960 0000000000000046 
0000000000000001 ffffffff884776b0
Nov 27 10:52:43 node2 kernel:  ffff810107f24000 000000000000000a 
ffff81022408a7e0 ffff81023c101080
Nov 27 10:52:43 node2 kernel:  000254bcf81e00e9 00000000000003b2 
ffff81022408a9c8 0000000100000000
Nov 27 10:52:43 node2 kernel: Call Trace:
Nov 27 10:52:43 node2 kernel:  [<ffffffff884776b0>] 
:lnet:LNetMDBind+0x2c0/0x420
Nov 27 10:52:43 node2 kernel:  [<ffffffff8003d382>] 
lock_timer_base+0x1b/0x3c
Nov 27 10:52:43 node2 kernel:  [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe
Nov 27 10:52:43 node2 kernel:  [<ffffffff80063947>] 
schedule_timeout+0x8a/0xad
Nov 27 10:52:43 node2 kernel:  [<ffffffff80094f4f>] process_timeout+0x0/0x5
Nov 27 10:52:43 node2 kernel:  [<ffffffff888060dc>] 
:ost:ost_brw_read+0x127c/0x1a30
Nov 27 10:52:43 node2 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Nov 27 10:52:43 node2 kernel:  [<ffffffff885865d8>] 
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Nov 27 10:52:43 node2 kernel:  [<ffffffff8880e429>] 
:ost:ost_handle+0x3129/0x5a70
Nov 27 10:52:43 node2 kernel:  [<ffffffff88478923>] 
:lnet:lnet_ni_send+0x93/0xd0
Nov 27 10:52:43 node2 kernel:  [<ffffffff8847ad23>] 
:lnet:lnet_send+0x973/0x9a0
Nov 27 10:52:43 node2 kernel:  [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe
Nov 27 10:52:43 node2 kernel:  [<ffffffff8847f305>] 
:lnet:lnet_match_blocked_msg+0x375/0x390
Nov 27 10:52:43 node2 kernel:  [<ffffffff8855be8a>] 
:ptlrpc:ldlm_resource_foreach+0x11a/0x390
Nov 27 10:52:43 node2 kernel:  [<ffffffff884dadf1>] 
:obdclass:class_handle2object+0xd1/0x160
Nov 27 10:52:43 node2 kernel:  [<ffffffff80148e8c>] __next_cpu+0x19/0x28
Nov 27 10:52:43 node2 kernel:  [<ffffffff80088f36>] 
find_busiest_group+0x20d/0x621
Nov 27 10:52:43 node2 kernel:  [<ffffffff8854119a>] 
:ptlrpc:lock_res_and_lock+0xba/0xd0
Nov 27 10:52:43 node2 kernel:  [<ffffffff8858bf05>] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Nov 27 10:52:43 node2 kernel:  [<ffffffff80089d8d>] enqueue_task+0x41/0x56
Nov 27 10:52:43 node2 kernel:  [<ffffffff88590c1d>] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Nov 27 10:52:43 node2 kernel:  [<ffffffff88593357>] 
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Nov 27 10:52:43 node2 kernel:  [<ffffffff8008881d>] 
__wake_up_common+0x3e/0x68
Nov 27 10:52:44 node2 kernel:  [<ffffffff88596e08>] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Nov 27 10:52:44 node2 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Nov 27 10:52:44 node2 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 27 10:52:44 node2 kernel:  [<ffffffff88595bf0>] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Nov 27 10:52:44 node2 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov 27 10:52:44 node2 kernel:
Nov 27 10:52:44 node2 kernel: LustreError: dumping log to 
/tmp/lustre-log.1259315563.6222

I attach the lustre-log.

I googled the net for LNetMDBind, but I found no useful pages.

I see this in the Understanding_Lustre_Filesystem_Internals.pdf, but 
it's not really helpful to me.

int LNetMDBind(
    lnet_md_t umd,
    lnet_unlink_t unlink,
    lnet_handle_md_t *handle)

This function creates a standalone memory descriptor, i.e., an MD that 
is not attached
to an ME.

By the way it's a bit strange, nobody else met this bug yet.


tamas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre-log.1259315563.6222.bz2
Type: application/octet-stream
Size: 9859 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091127/d315eee5/attachment.obj>


More information about the lustre-discuss mailing list