[Lustre-discuss] soft lockups on NFS server/Lustre client

Frederik Ferner frederik.ferner at diamond.ac.uk
Mon Oct 12 09:06:28 PDT 2009


Hi List,

on our NFS server exporting our Lustre file system to a number of NFS 
clients, we've recently started to see "kernel: BUG: soft lockup" 
messages. As the locked processes include nfsd, our users are obviously 
not happy.

Around the time when the soft lockup occurs we also see a log of 
"kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags()" 
messages, but I don't know if this is related.

We are using Lustre 1.6.6 on all machines, (MDS, OSS, clients). The NFS 
server/Lustre client with the lockups is running RHEL5.4 with an 
unpatched RedHat kernel (kernel-2.6.18-92.1.10.el5) with the Lustre 
modules from Sun.

See below for sample logs from the Lustre client/NFS server. I can 
provide more logs if required.

I'm not sure if this a Lustre issue but would appreciate if someone 
could help. We've not seen it on any other NFS server so far and there 
seems to be at least some lustre related stuff in the stack trace.

Is this a known issue and how can we avoid this? I have not found 
anything using google and the search on bugzilla.lustre.org. At least 
the BUG warning seems to be a known issue on this kernel.

I hope the logs below are readable enough, I tried to find entries where 
the stack traces don't overlap but this seems to be the best I can find.

Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at 
fs/inotify.c:181/set_dentry_child_flags() (Tainted: G     )
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed7d1>] 
set_dentry_child_flags+0xef/0x14d
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed867>] 
remove_watch_no_event+0x38/0x47
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed88e>] 
inotify_remove_watch_locked+0x18/0x3b
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed97c>] 
inotify_rm_wd+0x7e/0xa1
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ede6e>] 
sys_inotify_rm_watch+0x46/0x63
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff8005d28d>] 
tracesys+0xd5/0xe0
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at 
fs/inotify.c:181/set_dentry_child_flags() (Tainted: G     )
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed7d1>] 
set_dentry_child_flags+0xef/0x14d
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed867>] 
remove_watch_no_event+0x38/0x47
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed88e>] 
inotify_remove_watch_locked+0x18/0x3b
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed97c>] 
inotify_rm_wd+0x7e/0xa1
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ede6e>] 
sys_inotify_rm_watch+0x46/0x63
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: soft lockup - CPU#5 stuck 
for 10s! [nfsd:22221]
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CPU 5:
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Modules linked in: vfat fat 
usb_storage dell_rbu mptctl ipmi_devintf ipmi_si ipmi_msghandler nfs 
fscache nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 hidp mgc(U) 
lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob
dclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm l2cap bluetooth sunrpc ipv6 
xfrm_nalgo crypto_api mlx4_en(U) dm_multipath video sbs backlight i2c_ec 
i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp 
parport joydev sr_mod cdrom mlx4_core(U) bnx2 serio_raw pcsp
kr sg dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata shpchp mptsas 
mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd 
ohci_hcd ehci_hcd
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Pid: 22221, comm: nfsd Tainted: 
G      2.6.18-92.1.10.el5 #1
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RIP: 0010:[<ffffffff80064ba7>] 
  [<ffffffff80064ba7>] .text.lock.spinlock+0x5/0x30
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RSP: 0018:ffff810044241ac8 
EFLAGS: 00000286
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RAX: ffff81006cb6a1a8 RBX: 
ffff81006cb6a178 RCX: ffff810044241b50
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RDX: 0000000000000000 RSI: 
ffff810044241c90 RDI: ffffffff803c7480
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RBP: ffff81005d609e90 R08: 
0000000000000001 R09: ffff810044241b50
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: R10: ffffffff887cf72a R11: 
00000000000189ef R12: 000000a800000000
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: R13: ffff810044241c90 R14: 
0000000000000000 R15: ffffffff8001d54c
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: FS:  00002b637558e6e0(0000) 
GS:ffff810037c0c540(0000) knlGS:0000000000000000
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CS:  0010 DS: 0000 ES: 0000 
CR0: 000000008005003b
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CR2: 00002b473a3a4000 CR3: 
000000006934d000 CR4: 00000000000006e0
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Call Trace:
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8004fc47>] 
d_find_alias+0x1c/0x38
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff800e271e>] 
d_alloc_anon+0xc/0xf8
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff88703338>] 
:lustre:ll_iget_for_nfs+0x608/0x7e0
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887c2366>] 
:exportfs:find_exported_dentry+0x43/0x47b
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cf72a>] 
:nfsd:nfsd_acceptable+0x0/0xd8
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887d35ff>] 
:nfsd:exp_get_by_name+0x5b/0x71
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887d3bee>] 
:nfsd:exp_find_key+0x89/0x9c
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8002e28d>] 
__wake_up+0x38/0x4f
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8009884c>] 
set_current_groups+0x159/0x164
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887c27e9>] 
:exportfs:export_decode_fh+0x4b/0x52
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cfaa4>] 
:nfsd:fh_verify+0x2a2/0x4c6
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8008a788>] 
__activate_task+0x27/0x39
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887d00ca>] 
:nfsd:nfsd_access+0x29/0xfc
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887d76d0>] 
:nfsd:nfsd3_proc_access+0xa4/0xb0
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cd1db>] 
:nfsd:nfsd_dispatch+0xd8/0x1d6
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8839a4fb>] 
:sunrpc:svc_process+0x454/0x71b
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff800645ec>] 
__down_read+0x12/0x92
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cd5a1>] 
:nfsd:nfsd+0x0/0x2cb
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cd746>] 
:nfsd:nfsd+0x1a5/0x2cb
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8005dfb1>] 
child_rip+0xa/0x11
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cd5a1>] 
:nfsd:nfsd+0x0/0x2cb
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cd5a1>] 
:nfsd:nfsd+0x0/0x2cb
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8005dfa7>] 
child_rip+0x0/0x11
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8005d28d>] 
tracesys+0xd5/0xe0
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:


Kind regards,
Frederik
-- 
Frederik Ferner
Computer Systems Administrator		phone: +44 1235 77 8624
Diamond Light Source Ltd.		mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)



More information about the lustre-discuss mailing list