[Lustre-discuss] oracle lustre 1.8.7 kernel panics on dell c6145 - AMD Opteron 6234

Lenny Shovsky lenny at wirewalk.com
Wed Jul 18 06:50:33 PDT 2012


Thank you for you reply.  I agree this isn't a Lustre problem, but
since we're using the officially distributed Lustre.org kernel, I hope
the patches mentioned
can find its way to the official Lustre kernels as well.

I'm taking this to kernel.org team for the time being.  Found patches
for 2.6.32, but nothing for 2.6.18 yet.





On Wed, Jul 18, 2012 at 12:21 AM, Rappleye, Jason  (ARC-TN)[Computer
Sciences Corporation] <jason.rappleye at nasa.gov> wrote:
> Hi,
>
> It's not a Lustre issue. It could be this:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=16991
>
>
> The bug mentions a 2.6.32 kernel, but some googling shows that it happens
> with 2.6.18 as well:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=549853
>
>
> The workaround in comment 16 (of the first link above) did the trick for
> us. If you're experiencing this particular crash often enough, it should
> be easy enough to try the workaround to see if it fixes the problem.
>
> Jason
>
> On 7/17/12 8:48 PM, "Lenny Shovsky" <lenny at wirewalk.com> wrote:
>
>>this has been otherwise very stable on similar Opteron 6174 platforms
>>and many new Xeons
>>but Opteron 6234 seems to have issues.
>>
>>smp related ? is anyone using amd 6234 models or specifically dell c6145s
>>?
>>
>>full crash output is here.
>>
>>http://pastebin.com/AtvtCwXf
>>
>>sample is below.
>>
>>thanks a lot in advance !
>>
>>
>>
>>AMD Opteron(TM) Processor 6234                  stepping 02
>>Brought up 48 CPUs
>>testing NMI watchdog ... OK.
>>time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer.
>>time.c: Detected 2400.187 MHz processor.
>>divide error: 0000 [1] SMP
>>last sysfs file:
>>CPU 1
>>Modules linked in:
>>Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1
>>RIP: 0010:[<ffffffff8008bb03>]  [<ffffffff8008bb03>]
>>find_busiest_group+0x23a/0x621
>>RSP: 0018:ffff81102805fdb8  EFLAGS: 00010046
>>RAX: 0000000000004000 RBX: 00000000000000ff RCX: 0000000000000000
>>RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0
>>RBP: ffff81102805fea8 R08: 0000000000000006 R09: 000000000000003a
>>R10: ffff810836279e08 R11: 0000000000000048 R12: ffff810836279e00
>>R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000004000
>>FS:  0000000000000000(0000) GS:ffff81010e95eec0(0000)
>>knlGS:0000000000000000
>>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>>CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
>>Process swapper (pid: 0, threadinfo ffff810828060000, task
>>ffff8104280537a0)
>>Stack:  0000000000000000 ffff81102805fee8 ffff81102805ff10
>>0000000000000000
>> ffff81102805ff08 000000010100caa0 ffff81000100e260 0000000000000000
>> 0000000000000000 0000000000000000 0000000000000080 0000000000000000
>>Call Trace:
>> <IRQ>  [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb
>> [<ffffffff8009829f>] update_process_times+0x68/0x78
>> [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66
>> [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47
>> [<ffffffff80057018>] mwait_idle+0x0/0x4a
>> [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
>> <EOI>  [<ffffffff8005704e>] mwait_idle+0x36/0x4a
>> [<ffffffff80049206>] cpu_idle+0x95/0xb8
>> [<ffffffff8007796b>] start_secondary+0x498/0x4a7
>>
>>
>>Code: 48 f7 f6 49 c1 ee 07 83 7d cc 00 74 1c 48 8b 55 d0 4c 89 a5
>>RIP  [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621
>> RSP <ffff81102805fdb8>
>> <0>Kernel panic - not syncing: Fatal exception
>>divide error: 0000 [2] SMP
>>last sysfs file:
>>CPU 5
>>Modules linked in:
>>Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1
>>RIP: 0010:[<ffffffff8008bb03>]  [<ffffffff8008bb03>]
>>find_busiest_group+0x23a/0x621
>>RSP: 0018:ffff81083616bdb8  EFLAGS: 00010046
>>RAX: 0000000000004000 RBX: 00000000000000ff RCX: 0000000000000000
>>RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0
>>RBP: ffff81083616bea8 R08: 0000000000000006 R09: 000000000000003a
>>R10: ffff810836279e08 R11: 0000000000000048 R12: ffff810836279e00
>>R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000004000
>>FS:  0000000000000000(0000) GS:ffff81183611a2c0(0000)
>>knlGS:0000000000000000
>>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>>CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
>>Process swapper (pid: 0, threadinfo ffff81010e9d0000, task
>>ffff811c36155080)
>>Stack:  0000000000000000 ffff81083616bee8 ffff81083616bf10
>>0000000000000000
>> ffff81083616bf08 0000000500000000 ffff81000102fc60 0000000000000000
>> 0000000000000000 0000000000000000 0000000000000080 0000000000000000
>>Call Trace:
>> <IRQ>  [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb
>> [<ffffffff8009829f>] update_process_times+0x68/0x78
>> [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66
>> [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47
>> [<ffffffff80057018>] mwait_idle+0x0/0x4a
>> [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
>> <EOI>  [<ffffffff8005704e>] mwait_idle+0x36/0x4a
>> [<ffffffff80049206>] cpu_idle+0x95/0xb8
>> [<ffffffff8007796b>] start_secondary+0x498/0x4a7
>>
>>
>>Code: 48 f7 f6 49 c1 ee 07 83 7d cc 00 74 1c 48 8b 55 d0 4c 89 a5
>>RIP  [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621
>> RSP <ffff81083616bdb8>
>> <0>Kernel panic - not syncing: Fatal exception
>> <0>divide error: 0000 [3] SMP
>>last sysfs file:
>>CPU 8
>>Modules linked in:
>>Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1
>>RIP: 0010:[<ffffffff8008bb03>]  [<ffffffff8008bb03>]
>>find_busiest_group+0x23a/0x621
>>RSP: 0018:ffff811036173db8  EFLAGS: 00010002
>>RAX: 0000000000000000 RBX: 00000000000000ff RCX: 0000000000000000
>>RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0
>>RBP: ffff811036173ea8 R08: 0000000000000012 R09: 000000000000002e
>>R10: ffff8104363251c8 R11: 0000000000000048 R12: ffff8104363251c0
>>R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>FS:  0000000000000000(0000) GS:ffff810436286940(0000)
>>knlGS:0000000000000000
>>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>>CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
>>Process swapper (pid: 0, threadinfo ffff81083618e000, task
>>ffff810436229820)
>>Stack:  0000000000000000 ffff811036173ee8 ffff811036173f10
>>0000000000000000
>> ffff811036173f08 0000000800000000 ffff8104360b38e0 0000000000000000
>> ffff810436325180 0000000000000000 0000000000000000 0000000000000000
>>Call Trace:
>> <IRQ>  [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb
>> [<ffffffff8009829f>] update_process_times+0x68/0x78
>> [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66
>> [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47
>> [<ffffffff80057018>] mwait_idle+0x0/0x4a
>> [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
>> <EOI>  [<ffffffff8005704e>] mwait_idle+0x36/0x4a
>> [<ffffffff80049206>] cpu_idle+0x95/0xb8
>> [<ffffffff8007796b>] start_secondary+0x498/0x4a7
>>_______________________________________________
>>Lustre-discuss mailing list
>>Lustre-discuss at lists.lustre.org
>>http://lists.lustre.org/mailman/listinfo/lustre-discuss
>



More information about the lustre-discuss mailing list