[Lustre-discuss] 1.8.1.1

Mon Nov 9 10:34:36 PST 2009

hi All,

First off all, I'm sorry, I cannot write a better bug report as I'm far 
away from the host, and right now there is no remote access.
My colleague send this in by email:

Nov  9 19:05:55 node1 kernel:  [<ffffffff88591357>] 
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Nov  9 19:05:55 node1 kernel:  [<ffffffff8003d382>] 
lock_timer_base+0x1b/0x3c
Nov  9 19:05:55 node1 kernel:  [<ffffffff8008881d>] 
__wake_up_common+0x3e/0x68
Nov  9 19:05:55 node1 kernel:  [<ffffffff88594e08>] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Nov  9 19:05:55 node1 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Nov  9 19:05:55 node1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov  9 19:05:55 node1 kernel:  [<ffffffff88593bf0>] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Nov  9 19:05:55 node1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov  9 19:05:55 node1 kernel:
Nov  9 19:05:55 node1 kernel: Lustre: 
4381:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 
12345-192.168.3.149 at tcp
Nov  9 19:06:05 node1 last message repeated 409322 times
Nov  9 19:06:05 node1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! 
[ll_ost_82:4381]
Nov  9 19:06:05 node1 kernel: CPU 1:
Nov  9 19:06:05 node1 kernel: Modules linked in: obdfilter(U) 
fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) 
mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) 
lvfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) 
lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) 
dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) 
sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) 
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) igb(U) 
shpchp(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) 
dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) cciss(U) ata_piix(U) 
libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) 
ehci_hcd(U)
Nov  9 19:06:05 node1 kernel: Pid: 4381, comm: ll_ost_82 Tainted: G      
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Nov  9 19:06:05 node1 kernel: RIP: 0010:[<ffffffff80064ae0>]  
[<ffffffff80064ae0>] _spin_lock+0x3/0xa
Nov  9 19:06:05 node1 kernel: RSP: 0018:ffff8102217b7758  EFLAGS: 00000246
Nov  9 19:06:05 node1 kernel: RAX: 0000000000000008 RBX: 
ffff81022c80b400 RCX: 0000000000000000
Nov  9 19:06:05 node1 kernel: RDX: ffff81023df660a0 RSI: 
0000000000000000 RDI: ffff81023df66250
Nov  9 19:06:05 node1 kernel: RBP: ffff81022c80b400 R08: 
ffff81022c80b530 R09: 0000000000000000
Nov  9 19:06:05 node1 kernel: R10: 0000000000000000 R11: 
0000000000000000 R12: 0000000000000013
Nov  9 19:06:05 node1 kernel: R13: ffffffff8857327c R14: 
0000000500000000 R15: 0000000000000007
Nov  9 19:06:05 node1 kernel: FS:  00002b4cfcfab230(0000) 
GS:ffff810107ed96c0(0000) knlGS:0000000000000000
Nov  9 19:06:05 node1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Nov  9 19:06:05 node1 kernel: CR2: 00002abf9b119158 CR3: 
0000000000201000 CR4: 00000000000006e0
Nov  9 19:06:05 node1 kernel:
Nov  9 19:06:05 node1 kernel: Call Trace:
Nov  9 19:06:05 node1 kernel:  [<ffffffff8857cafc>] 
:ptlrpc:ptlrpc_queue_wait+0x103c/0x1690
Nov  9 19:06:05 node1 kernel:  [<ffffffff8858a515>] 
:ptlrpc:lustre_msg_set_opc+0x45/0x120
Nov  9 19:06:05 node1 kernel:  [<ffffffff88574085>] 
:ptlrpc:ptlrpc_at_set_req_timeout+0x85/0xd0
Nov  9 19:06:05 node1 kernel:  [<ffffffff885748a9>] 
:ptlrpc:ptlrpc_prep_req_pool+0x619/0x6b0
Nov  9 19:06:05 node1 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Nov  9 19:06:05 node1 kernel:  [<ffffffff88564196>] 
:ptlrpc:ldlm_server_glimpse_ast+0x266/0x3b0
Nov  9 19:06:05 node1 kernel:  [<ffffffff88570f03>] 
:ptlrpc:interval_iterate_reverse+0x73/0x240
Nov  9 19:06:05 node1 kernel:  [<ffffffff88558f20>] 
:ptlrpc:ldlm_process_extent_lock+0x0/0xad0

History:

The "cluster" is up for approx. 10 days. It has only one MDS and 2 OSS 
computers.
On the second day the node1 locked up with no usable messages on the 
screen, and similar messages in the log as above.
After I restarted it the cluster was running for a week without bigger 
error messages.

On Sunday we removed the servers, so we had to restart them.

Today morning node2 locked up, and after some hours node1 also start to 
give up.

The systems are based on CentOS 5.4 with the official packages from Sun.

Is there any known bug about this?

Thank you,

tamas