[Lustre-discuss] MDS crashing after upgrade to 1.6.7
Nirmal Seenu
nirmal at fnal.gov
Fri Mar 20 13:52:35 PDT 2009
I have been having the same problem on one the clusters where I upgraded
the Lustre servers from 1.6.6 to 1.6.7. In this cluster all the worker
nodes/clients run the Lustre-1.6.6 kernel:
2.6.18-92.1.10.el5_lustre.1.6.6smp.
I don't get any hardware errors on the Lustre servers but I get pretty
much the same errors in the Call trace on/var/log/kernel on the MDT
servers and OST servers.
This was the last call trace that I got on my MDT:
Mar 20 13:05:40 lustre3 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[ldlm_cn_02:4302]
Mar 20 13:05:40 lustre3 kernel: CPU 1:
Mar 20 13:05:40 lustre3 kernel: Modules linked in: obdfilter(U) ost(U)
mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U)
lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) ln
et(U) lvfs(U) libcfs(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U)
ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) hidp(U) l2cap(U)
bluetooth(U) sunrpc(U) vfat(U) fat(U) dm_multipath(U) video(U) sbs(U) bac
klight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U)
ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c_nforce2(U) sr_mod(U)
shpchp(U) forcedeth(U) i2c_core(U) serio_raw(U) cdrom(U) pcspk
r(U) sg(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U)
usb_storage(U) qla2xxx(U) scsi_transport_fc(U) sata_nv(U) pata_acpi(U)
libata(U) sd_mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd
(U) ehci_hcd(U)
Mar 20 13:05:40 lustre3 kernel: Pid: 4302, comm: ldlm_cn_02 Tainted: G
2.6.18-92.1.17.el5_lustre.1.6.7smp #1
Mar 20 13:05:40 lustre3 kernel: RIP: 0010:[<ffffffff885eb3a9>]
[<ffffffff885eb3a9>] :obdclass:class_handle2object+0xe9/0x160
Mar 20 13:05:40 lustre3 kernel: RSP: 0018:ffff81020720bb90 EFLAGS: 00000216
Mar 20 13:05:40 lustre3 kernel: RAX: ffffc200009be818 RBX:
bbdfa43ac476d656 RCX: ffff81000100e8e0
Mar 20 13:05:40 lustre3 kernel: RDX: ffff81026da41a00 RSI:
0000000000000000 RDI: bbdfa43ac476d656
Mar 20 13:05:40 lustre3 kernel: RBP: ffffffff88686959 R08:
ffff81025ca06200 R09: 5a5a5a5a5a5a5a5a
Mar 20 13:05:40 lustre3 kernel: R10: 5a5a5a5a5a5a5a5a R11:
5a5a5a5a5a5a5a5a R12: ffff81020720bbf0
Mar 20 13:05:40 lustre3 kernel: R13: ffffffff8866e61f R14:
ffffc200040f6220 R15: ffff81025ca06200
Mar 20 13:05:40 lustre3 kernel: FS: 00002ac881c2d220(0000)
GS:ffff810107799440(0000) knlGS:0000000000000000
Mar 20 13:05:40 lustre3 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Mar 20 13:05:40 lustre3 kernel: CR2: 00002acecca1d000 CR3:
000000041fe59000 CR4: 00000000000006e0
Mar 20 13:05:40 lustre3 kernel:
Mar 20 13:05:40 lustre3 kernel: Call Trace:
Mar 20 13:05:40 lustre3 kernel: [<ffffffff88656526>]
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
I am planning on upgrading the worker nodes and client to the latest
Redhat kernel 2.6.18-128.1.1.el5 and installing the patch-less clients
to see if that fixes the problem.
I don't have a problem on the other cluster where the servers are
running Lustre-1.6.7 and the clients are running 2.6.21 kernel.org
kernel with patch-less clients.
Nirmal
More information about the lustre-discuss
mailing list