[Lustre-discuss] oracle lustre 1.8.7 kernel panics on dell c6145 - AMD Opteron 6234

Rappleye, Jason (ARC-TN)[Computer Sciences Corporation] jason.rappleye at nasa.gov
Tue Jul 17 21:21:21 PDT 2012


Hi,

It's not a Lustre issue. It could be this:

https://bugzilla.kernel.org/show_bug.cgi?id=16991


The bug mentions a 2.6.32 kernel, but some googling shows that it happens
with 2.6.18 as well:

https://bugzilla.redhat.com/show_bug.cgi?id=549853


The workaround in comment 16 (of the first link above) did the trick for
us. If you're experiencing this particular crash often enough, it should
be easy enough to try the workaround to see if it fixes the problem.

Jason

On 7/17/12 8:48 PM, "Lenny Shovsky" <lenny at wirewalk.com> wrote:

>this has been otherwise very stable on similar Opteron 6174 platforms
>and many new Xeons
>but Opteron 6234 seems to have issues.
>
>smp related ? is anyone using amd 6234 models or specifically dell c6145s
>?
>
>full crash output is here.
>
>http://pastebin.com/AtvtCwXf
>
>sample is below.
>
>thanks a lot in advance !
>
>
>
>AMD Opteron(TM) Processor 6234                  stepping 02
>Brought up 48 CPUs
>testing NMI watchdog ... OK.
>time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer.
>time.c: Detected 2400.187 MHz processor.
>divide error: 0000 [1] SMP
>last sysfs file:
>CPU 1
>Modules linked in:
>Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1
>RIP: 0010:[<ffffffff8008bb03>]  [<ffffffff8008bb03>]
>find_busiest_group+0x23a/0x621
>RSP: 0018:ffff81102805fdb8  EFLAGS: 00010046
>RAX: 0000000000004000 RBX: 00000000000000ff RCX: 0000000000000000
>RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0
>RBP: ffff81102805fea8 R08: 0000000000000006 R09: 000000000000003a
>R10: ffff810836279e08 R11: 0000000000000048 R12: ffff810836279e00
>R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000004000
>FS:  0000000000000000(0000) GS:ffff81010e95eec0(0000)
>knlGS:0000000000000000
>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
>Process swapper (pid: 0, threadinfo ffff810828060000, task
>ffff8104280537a0)
>Stack:  0000000000000000 ffff81102805fee8 ffff81102805ff10
>0000000000000000
> ffff81102805ff08 000000010100caa0 ffff81000100e260 0000000000000000
> 0000000000000000 0000000000000000 0000000000000080 0000000000000000
>Call Trace:
> <IRQ>  [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb
> [<ffffffff8009829f>] update_process_times+0x68/0x78
> [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66
> [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47
> [<ffffffff80057018>] mwait_idle+0x0/0x4a
> [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
> <EOI>  [<ffffffff8005704e>] mwait_idle+0x36/0x4a
> [<ffffffff80049206>] cpu_idle+0x95/0xb8
> [<ffffffff8007796b>] start_secondary+0x498/0x4a7
>
>
>Code: 48 f7 f6 49 c1 ee 07 83 7d cc 00 74 1c 48 8b 55 d0 4c 89 a5
>RIP  [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621
> RSP <ffff81102805fdb8>
> <0>Kernel panic - not syncing: Fatal exception
>divide error: 0000 [2] SMP
>last sysfs file:
>CPU 5
>Modules linked in:
>Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1
>RIP: 0010:[<ffffffff8008bb03>]  [<ffffffff8008bb03>]
>find_busiest_group+0x23a/0x621
>RSP: 0018:ffff81083616bdb8  EFLAGS: 00010046
>RAX: 0000000000004000 RBX: 00000000000000ff RCX: 0000000000000000
>RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0
>RBP: ffff81083616bea8 R08: 0000000000000006 R09: 000000000000003a
>R10: ffff810836279e08 R11: 0000000000000048 R12: ffff810836279e00
>R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000004000
>FS:  0000000000000000(0000) GS:ffff81183611a2c0(0000)
>knlGS:0000000000000000
>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
>Process swapper (pid: 0, threadinfo ffff81010e9d0000, task
>ffff811c36155080)
>Stack:  0000000000000000 ffff81083616bee8 ffff81083616bf10
>0000000000000000
> ffff81083616bf08 0000000500000000 ffff81000102fc60 0000000000000000
> 0000000000000000 0000000000000000 0000000000000080 0000000000000000
>Call Trace:
> <IRQ>  [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb
> [<ffffffff8009829f>] update_process_times+0x68/0x78
> [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66
> [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47
> [<ffffffff80057018>] mwait_idle+0x0/0x4a
> [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
> <EOI>  [<ffffffff8005704e>] mwait_idle+0x36/0x4a
> [<ffffffff80049206>] cpu_idle+0x95/0xb8
> [<ffffffff8007796b>] start_secondary+0x498/0x4a7
>
>
>Code: 48 f7 f6 49 c1 ee 07 83 7d cc 00 74 1c 48 8b 55 d0 4c 89 a5
>RIP  [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621
> RSP <ffff81083616bdb8>
> <0>Kernel panic - not syncing: Fatal exception
> <0>divide error: 0000 [3] SMP
>last sysfs file:
>CPU 8
>Modules linked in:
>Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1
>RIP: 0010:[<ffffffff8008bb03>]  [<ffffffff8008bb03>]
>find_busiest_group+0x23a/0x621
>RSP: 0018:ffff811036173db8  EFLAGS: 00010002
>RAX: 0000000000000000 RBX: 00000000000000ff RCX: 0000000000000000
>RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0
>RBP: ffff811036173ea8 R08: 0000000000000012 R09: 000000000000002e
>R10: ffff8104363251c8 R11: 0000000000000048 R12: ffff8104363251c0
>R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>FS:  0000000000000000(0000) GS:ffff810436286940(0000)
>knlGS:0000000000000000
>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
>Process swapper (pid: 0, threadinfo ffff81083618e000, task
>ffff810436229820)
>Stack:  0000000000000000 ffff811036173ee8 ffff811036173f10
>0000000000000000
> ffff811036173f08 0000000800000000 ffff8104360b38e0 0000000000000000
> ffff810436325180 0000000000000000 0000000000000000 0000000000000000
>Call Trace:
> <IRQ>  [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb
> [<ffffffff8009829f>] update_process_times+0x68/0x78
> [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66
> [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47
> [<ffffffff80057018>] mwait_idle+0x0/0x4a
> [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
> <EOI>  [<ffffffff8005704e>] mwait_idle+0x36/0x4a
> [<ffffffff80049206>] cpu_idle+0x95/0xb8
> [<ffffffff8007796b>] start_secondary+0x498/0x4a7
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list