[Lustre-discuss] MDS crashing after upgrade to 1.6.7

Fri Mar 20 13:52:35 PDT 2009

I have been having the same problem on one the clusters where I upgraded 
the Lustre servers from 1.6.6 to 1.6.7. In this cluster all the worker 
nodes/clients run the Lustre-1.6.6 kernel: 
2.6.18-92.1.10.el5_lustre.1.6.6smp.

I don't get any hardware errors on the Lustre servers but I get pretty 
much the same errors in the Call trace on/var/log/kernel on the MDT 
servers and OST servers.

This was the last call trace that I got on my MDT:
Mar 20 13:05:40 lustre3 kernel: BUG: soft lockup - CPU#1 stuck for 10s! 
[ldlm_cn_02:4302]
Mar 20 13:05:40 lustre3 kernel: CPU 1:
Mar 20 13:05:40 lustre3 kernel: Modules linked in: obdfilter(U) ost(U) 
mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) 
lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) ln
et(U) lvfs(U) libcfs(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) 
ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) hidp(U) l2cap(U) 
bluetooth(U) sunrpc(U) vfat(U) fat(U) dm_multipath(U) video(U) sbs(U) bac
klight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) 
ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c_nforce2(U) sr_mod(U) 
shpchp(U) forcedeth(U) i2c_core(U) serio_raw(U) cdrom(U) pcspk
r(U) sg(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) 
usb_storage(U) qla2xxx(U) scsi_transport_fc(U) sata_nv(U) pata_acpi(U) 
libata(U) sd_mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd
(U) ehci_hcd(U)
Mar 20 13:05:40 lustre3 kernel: Pid: 4302, comm: ldlm_cn_02 Tainted: G 
     2.6.18-92.1.17.el5_lustre.1.6.7smp #1
Mar 20 13:05:40 lustre3 kernel: RIP: 0010:[<ffffffff885eb3a9>] 
[<ffffffff885eb3a9>] :obdclass:class_handle2object+0xe9/0x160
Mar 20 13:05:40 lustre3 kernel: RSP: 0018:ffff81020720bb90  EFLAGS: 00000216
Mar 20 13:05:40 lustre3 kernel: RAX: ffffc200009be818 RBX: 
bbdfa43ac476d656 RCX: ffff81000100e8e0
Mar 20 13:05:40 lustre3 kernel: RDX: ffff81026da41a00 RSI: 
0000000000000000 RDI: bbdfa43ac476d656
Mar 20 13:05:40 lustre3 kernel: RBP: ffffffff88686959 R08: 
ffff81025ca06200 R09: 5a5a5a5a5a5a5a5a
Mar 20 13:05:40 lustre3 kernel: R10: 5a5a5a5a5a5a5a5a R11: 
5a5a5a5a5a5a5a5a R12: ffff81020720bbf0
Mar 20 13:05:40 lustre3 kernel: R13: ffffffff8866e61f R14: 
ffffc200040f6220 R15: ffff81025ca06200
Mar 20 13:05:40 lustre3 kernel: FS:  00002ac881c2d220(0000) 
GS:ffff810107799440(0000) knlGS:0000000000000000
Mar 20 13:05:40 lustre3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Mar 20 13:05:40 lustre3 kernel: CR2: 00002acecca1d000 CR3: 
000000041fe59000 CR4: 00000000000006e0
Mar 20 13:05:40 lustre3 kernel:
Mar 20 13:05:40 lustre3 kernel: Call Trace:
Mar 20 13:05:40 lustre3 kernel:  [<ffffffff88656526>] 
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0

I am planning on upgrading the worker nodes and client to the latest 
Redhat kernel 2.6.18-128.1.1.el5 and installing the patch-less clients 
to see if that fixes the problem.

I don't have a problem on the other cluster where the servers are 
running Lustre-1.6.7 and the clients are running 2.6.21 kernel.org 
kernel with patch-less clients.

Nirmal