[Lustre-discuss] 1.6.5.1 OSS crashes

Robin Humble rjh+lustre at cita.utoronto.ca
Fri Jul 18 02:52:31 PDT 2008


Hi,

I'm seeing coordinated OSS crashes with Lustre 1.6.5.1.

our RHEL4 OSS have been stable for ~months with these kernels:
  kernel-lustre-smp-2.6.9-67.0.4.EL_lustre.1.6.4.3
  kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.4.2

but have crashed hard, twice, about 10hrs apart as soon as we started
using this kernel:
  kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1

the weird thing is that as near as I can tell, both times all three
OSS's crashed at exactly the same time! couldn't even ping them, so
it was a pretty solid crash.

any ideas?
I can't see anything similar in bugzilla.

no logs got out of the nodes (we only use remote syslog) except for the
below Oops from one node at the time of the first crash.

thanks for any help!

cheers,
robin

Jul 17 00:10:29 x17 kernel: ----------- [cut here ] --------- [please bite here ] --------- 
Jul 17 00:10:29 x17 kernel: Kernel BUG at spinlock:76 
Jul 17 00:10:29 x17 kernel: invalid operand: 0000 [1] SMP  
Jul 17 00:10:29 x17 kernel: CPU 2  
Jul 17 00:10:29 x17 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) jbd(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid5(U) xor(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U) mlx4_ib(U) mlx4_core(U) dm_mod(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U) ib_mthca(U) ib_ipoib(U) md5(U) ipv6(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) sd_mod(U) qla2300(U) qla2xxx(U) scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) scsi_mod(U) nfs(U) nfs_acl(U) lockd(U) sunrpc(U) e1000(U) 
Jul 17 00:10:29 x17 kernel: Pid: 0, comm: swapper Not tainted 2.6.9-67.0.7.EL_lustre.1.6.5.1smp 
Jul 17 00:10:29 x17 kernel: RIP: 0010:[<ffffffff8030e97d>] <ffffffff8030e97d>{_spin_unlock_irqrestore+27} 
Jul 17 00:10:29 x17 kernel: RSP: 0018:000001009fa03ee0  EFLAGS: 00010002 
Jul 17 00:10:29 x17 kernel: RAX: 0000000000000001 RBX: 0000010198205680 RCX: 00001d6067022f90 
Jul 17 00:10:29 x17 kernel: RDX: 00000000045ba580 RSI: 0000000000000246 RDI: 0000010254bae9c0 
Jul 17 00:10:29 x17 kernel: RBP: 0000000000000000 R08: 0000000000000246 R09: 0000000000000000 
Jul 17 00:10:30 x17 kernel: R10: ffffffffa009e88c R11: ffffffffa0147cc2 R12: 0000000000000012 
Jul 17 00:10:30 x17 kernel: R13: 0000000000000001 R14: 0000000000000012 R15: 0000000000000000 
Jul 17 00:10:30 x17 kernel: FS:  0000000000000000(0000) GS:ffffffff8048e800(0000) knlGS:0000000000000000 
Jul 17 00:10:30 x17 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b 
Jul 17 00:10:30 x17 kernel: CR2: 0000002a9556c000 CR3: 000000009fb6e000 CR4: 00000000000006e0 
Jul 17 00:10:30 x17 kernel: Process swapper (pid: 0, threadinfo 000001009fb6c000, task 000001009f9de800) 
Jul 17 00:10:30 x17 kernel: Stack: ffffffffa0147f1d 0000000000002002 0000010198205680 000000000000000a  
Jul 17 00:10:30 x17 kernel:        0000000000000002 000001009fb6de98 ffffffffa009ed57 000001009fa03f18  
Jul 17 00:10:30 x17 kernel:        000001009fa03f18 000001009f5e2fc0  
Jul 17 00:10:30 x17 kernel: Call Trace:<IRQ> <ffffffffa0147f1d>{:sd_mod:sd_rw_intr+603} <ffffffffa009ed57>{:scsi_mod:scsi_softirq+213}  
Jul 17 00:10:30 x17 kernel:        <ffffffff8013c1e8>{__do_softirq+88} <ffffffff8013c291>{do_softirq+49}  
Jul 17 00:10:30 x17 kernel:        <ffffffff801130e3>{do_IRQ+328} <ffffffff801107d1>{ret_from_intr+0}  
Jul 17 00:10:30 x17 kernel:         <EOI> <ffffffff8010e80c>{mwait_idle+86} <ffffffff8010e79c>{cpu_idle+26}  
Jul 17 00:10:30 x17 kernel:         
Jul 17 00:10:30 x17 kernel:  
Jul 17 00:10:30 x17 kernel: Code: 0f 0b 37 87 32 80 ff ff ff ff 4c 00 c7 07 01 00 00 00 56 9d  
Jul 17 00:10:30 x17 kernel: RIP <ffffffff8030e97d>{_spin_unlock_irqrestore+27} RSP <000001009fa03ee0> 
Jul 17 00:10:30 x17 kernel:  <0>Kernel panic - not syncing: Oops 




More information about the lustre-discuss mailing list