[Lustre-discuss] Softlockup issues. Lustre related?

Alex Lee alee at datadirectnet.com
Wed Aug 27 20:20:47 PDT 2008


Hello Folks,

I have few client nodes that are getting soft lockup errors. These are patchless clients running Lustre 1.6.5.1 with kernel 2.6.18-53.1.6.el5-PAPI. More or less stock RHEL 5.1 with PAPI patch added on it. The MDS and OSS are running Lustre 1.6.5.1 with the supplied Lustre kernels and OFED 1.3.1.

I remember there was an issue with __d_lookup in the past but I thought it was fixed with the newest release of Lustre. So I dont know if this is related in anyway at all. I dont see any other real lustre error messages on the client or the MDS/OSS at the time of the softlock up. Also wasnt there a softirq issue? I dont think this is related to that...

Thanks,
-Alex


Here is the syslog output:

Aug 24 04:02:03 papi-0476 syslogd 1.4.1: restart.
Aug 25 00:52:24 papi-0476 kernel: Losing some ticks... checking if CPU frequency changed.
Aug 25 15:17:05 papi-0476 kernel: cpsMPI.x[10493]: segfault at 00002aaab23ce000 rip 00000000004a8fce rs
p 00007fff584d6cf0 error 4
Aug 27 04:15:00 papi-0476 kernel: BUG: soft lockup detected on CPU#4!
Aug 27 04:15:00 papi-0476 kernel:
Aug 27 04:15:00 papi-0476 kernel: Call Trace:
Aug 27 04:15:00 papi-0476 kernel:  <IRQ>  [<ffffffff800b619e>] softlockup_tick+0xd5/0xe7
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff800910ab>] tasklet_action+0x62/0xac
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff800941bb>] update_process_times+0x54/0x7a
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff80075727>] smp_local_timer_interrupt+0x2c/0x61
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff8005d368>] call_softirq+0x1c/0x28
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff80075def>] smp_apic_timer_interrupt+0x41/0x47
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c
Aug 27 04:15:00 papi-0476 kernel:  <EOI>  [<ffffffff8002c285>] dummy_inode_permission+0x0/0x3
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff800094bb>] __d_lookup+0xd2/0xff
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff80009499>] __d_lookup+0xb0/0xff
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff8000c990>] do_lookup+0x2c/0x1d4
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff80009ee9>] __link_path_walk+0xa01/0xf42
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff8000e5ca>] link_path_walk+0x5c/0xe5
Aug 27 04:15:00 papi-0476 kernel:  [<ffffffff8000c7e8>] do_path_lookup+0x270/0x2e8
Aug 27 04:15:01 papi-0476 kernel:  [<ffffffff800234de>] __user_walk_fd+0x37/0x4c
Aug 27 04:15:01 papi-0476 kernel:  [<ffffffff8003e6cf>] vfs_lstat_fd+0x18/0x47
Aug 27 04:15:01 papi-0476 kernel:  [<ffffffff800dc1b0>] sys_newfstatat+0x22/0x43
Aug 27 04:15:01 papi-0476 kernel:  [<ffffffff8005c116>] system_call+0x7e/0x83
Aug 27 04:15:01 papi-0476 kernel:
Aug 28 04:03:14 papi-0476 kernel: BUG: soft lockup detected on CPU#0!
Aug 28 04:03:14 papi-0476 kernel:
Aug 28 04:03:14 papi-0476 kernel: Call Trace:
Aug 28 04:03:14 papi-0476 kernel:  <IRQ>  [<ffffffff800b619e>] softlockup_tick+0xd5/0xe7
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff800941bb>] update_process_times+0x54/0x7a
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff80075727>] smp_local_timer_interrupt+0x2c/0x61
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff80075def>] smp_apic_timer_interrupt+0x41/0x47
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c
Aug 28 04:03:14 papi-0476 kernel:  <EOI>  [<ffffffff8002c285>] dummy_inode_permission+0x0/0x3
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff800094cb>] __d_lookup+0xe2/0xff
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff80009499>] __d_lookup+0xb0/0xff
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff8000c990>] do_lookup+0x2c/0x1d4
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff80009ee9>] __link_path_walk+0xa01/0xf42
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff8000e5ca>] link_path_walk+0x5c/0xe5
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff8000c7e8>] do_path_lookup+0x270/0x2e8
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff800234de>] __user_walk_fd+0x37/0x4c
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff8003e6cf>] vfs_lstat_fd+0x18/0x47
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff800dc1b0>] sys_newfstatat+0x22/0x43
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff8005c116>] system_call+0x7e/0x83
Aug 28 04:03:14 papi-0476 kernel:
Aug 28 04:12:12 papi-0476 kernel: BUG: soft lockup detected on CPU#3!
Aug 28 04:12:17 papi-0476 kernel:
Aug 28 04:12:18 papi-0476 kernel: Call Trace:
Aug 28 04:12:18 papi-0476 kernel:  <IRQ>  [<ffffffff800b619e>] softlockup_tick+0xd5/0xe7
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff800941bb>] update_process_times+0x54/0x7a
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80075727>] smp_local_timer_interrupt+0x2c/0x61
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80075def>] smp_apic_timer_interrupt+0x41/0x47
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80007660>] kmem_cache_free+0x1c0/0x1cb
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8009a5dc>] __rcu_process_callbacks+0x122/0x1a8
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8009a685>] rcu_process_callbacks+0x23/0x43
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff800910ab>] tasklet_action+0x62/0xac
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80011cc5>] __do_softirq+0x5e/0xd5
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8005d368>] call_softirq+0x1c/0x28
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8006b565>] do_softirq+0x2c/0x85
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff800dc1b0>] sys_newfstatat+0x22/0x43
Aug 28 04:03:14 papi-0476 kernel:  [<ffffffff8005c116>] system_call+0x7e/0x83
Aug 28 04:03:14 papi-0476 kernel:
Aug 28 04:12:12 papi-0476 kernel: BUG: soft lockup detected on CPU#3!
Aug 28 04:12:17 papi-0476 kernel:
Aug 28 04:12:18 papi-0476 kernel: Call Trace:
Aug 28 04:12:18 papi-0476 kernel:  <IRQ>  [<ffffffff800b619e>] softlockup_tick+0xd5/0xe7
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff800941bb>] update_process_times+0x54/0x7a
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80075727>] smp_local_timer_interrupt+0x2c/0x61
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80075def>] smp_apic_timer_interrupt+0x41/0x47
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80007660>] kmem_cache_free+0x1c0/0x1cb
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8009a5dc>] __rcu_process_callbacks+0x122/0x1a8
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8009a685>] rcu_process_callbacks+0x23/0x43
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff800910ab>] tasklet_action+0x62/0xac
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80011cc5>] __do_softirq+0x5e/0xd5
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8005d368>] call_softirq+0x1c/0x28
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8006b565>] do_softirq+0x2c/0x85
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c
Aug 28 04:12:18 papi-0476 kernel:  <EOI>  [<ffffffff800cb2c6>] __put_swap_token+0xb/0x52
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8003b7dc>] mmput+0x68/0x83
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80014f94>] do_exit+0x28b/0x89d
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80046eac>] cpuset_exit+0x0/0x6c
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff8002ad92>] get_signal_to_deliver+0x42c/0x45a
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff800590f9>] do_notify_resume+0x9c/0x7a9
Aug 28 04:12:18 papi-0476 kernel:  [<ffffffff80021d2e>] __up_read+0x19/0x7f
Aug 28 04:12:19 papi-0476 kernel:  [<ffffffff80065ac9>] do_page_fault+0x4eb/0x81d
Aug 28 04:12:19 papi-0476 kernel:  [<ffffffff800270db>] do_filp_open+0x2a/0x38
Aug 28 04:12:19 papi-0476 kernel:  [<ffffffff8003fc68>] do_ioctl+0x5c/0x6b
Aug 28 04:12:19 papi-0476 kernel:  [<ffffffff8005c6dc>] retint_signal+0x3d/0x79
Aug 28 04:12:19 papi-0476 kernel:







More information about the lustre-discuss mailing list