[Lustre-discuss] soft lockups on NFS server/Lustre client

Robin Humble robin.humble+lustre at anu.edu.au
Sun Oct 18 21:31:48 PDT 2009


On Mon, Oct 12, 2009 at 05:06:28PM +0100, Frederik Ferner wrote:
>Hi List,
>
>on our NFS server exporting our Lustre file system to a number of NFS 
>clients, we've recently started to see "kernel: BUG: soft lockup" 
>messages. As the locked processes include nfsd, our users are obviously 
>not happy.
>
>Around the time when the soft lockup occurs we also see a log of 
>"kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags()" 
>messages, but I don't know if this is related.

probably not related. we were seeing this too (no NFS involved at all)
  https://bugzilla.lustre.org/show_bug.cgi?id=20904
and the upshot is that I'm pretty sure it's harmless and a RHEL bug.
I filed
  https://bugzilla.redhat.com/show_bug.cgi?id=526853
but it's probably being ignored. if you have a rhel support contract
maybe you can kick it along a bit...

dunno about your soft lockups. as I understand it soft lockups
themselves aren't harmful as long as they progress eventually.

Lustre 1.6.6 isn't exactly recent. have you tried 1.6.7.2 on your NFS
exporter?

presumably soft lockups could also be saying your re-exporter or OSS's
are overloaded or that you have a slow disk or 3 in a RAID... without
NFS involved are all your OSTs up to speed?

do you still get problems after
  echo 60 > /proc/sys/kernel/softlockup_thresh

cheers,
robin

>
>We are using Lustre 1.6.6 on all machines, (MDS, OSS, clients). The NFS 
>server/Lustre client with the lockups is running RHEL5.4 with an 
>unpatched RedHat kernel (kernel-2.6.18-92.1.10.el5) with the Lustre 
>modules from Sun.
>
>See below for sample logs from the Lustre client/NFS server. I can 
>provide more logs if required.
>
>I'm not sure if this a Lustre issue but would appreciate if someone 
>could help. We've not seen it on any other NFS server so far and there 
>seems to be at least some lustre related stuff in the stack trace.
>
>Is this a known issue and how can we avoid this? I have not found 
>anything using google and the search on bugzilla.lustre.org. At least 
>the BUG warning seems to be a known issue on this kernel.
>
>I hope the logs below are readable enough, I tried to find entries where 
>the stack traces don't overlap but this seems to be the best I can find.
>
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at 
>fs/inotify.c:181/set_dentry_child_flags() (Tainted: G     )
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace:
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed7d1>] 
>set_dentry_child_flags+0xef/0x14d
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed867>] 
>remove_watch_no_event+0x38/0x47
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed88e>] 
>inotify_remove_watch_locked+0x18/0x3b
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed97c>] 
>inotify_rm_wd+0x7e/0xa1
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ede6e>] 
>sys_inotify_rm_watch+0x46/0x63
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff8005d28d>] 
>tracesys+0xd5/0xe0
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at 
>fs/inotify.c:181/set_dentry_child_flags() (Tainted: G     )
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace:
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed7d1>] 
>set_dentry_child_flags+0xef/0x14d
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed867>] 
>remove_watch_no_event+0x38/0x47
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed88e>] 
>inotify_remove_watch_locked+0x18/0x3b
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ed97c>] 
>inotify_rm_wd+0x7e/0xa1
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [<ffffffff800ede6e>] 
>sys_inotify_rm_watch+0x46/0x63
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: soft lockup - CPU#5 stuck 
>for 10s! [nfsd:22221]
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CPU 5:
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Modules linked in: vfat fat 
>usb_storage dell_rbu mptctl ipmi_devintf ipmi_si ipmi_msghandler nfs 
>fscache nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 hidp mgc(U) 
>lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob
>dclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm l2cap bluetooth sunrpc ipv6 
>xfrm_nalgo crypto_api mlx4_en(U) dm_multipath video sbs backlight i2c_ec 
>i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp 
>parport joydev sr_mod cdrom mlx4_core(U) bnx2 serio_raw pcsp
>kr sg dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata shpchp mptsas 
>mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd 
>ohci_hcd ehci_hcd
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Pid: 22221, comm: nfsd Tainted: 
>G      2.6.18-92.1.10.el5 #1
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RIP: 0010:[<ffffffff80064ba7>] 
>  [<ffffffff80064ba7>] .text.lock.spinlock+0x5/0x30
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RSP: 0018:ffff810044241ac8 
>EFLAGS: 00000286
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RAX: ffff81006cb6a1a8 RBX: 
>ffff81006cb6a178 RCX: ffff810044241b50
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RDX: 0000000000000000 RSI: 
>ffff810044241c90 RDI: ffffffff803c7480
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RBP: ffff81005d609e90 R08: 
>0000000000000001 R09: ffff810044241b50
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: R10: ffffffff887cf72a R11: 
>00000000000189ef R12: 000000a800000000
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: R13: ffff810044241c90 R14: 
>0000000000000000 R15: ffffffff8001d54c
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: FS:  00002b637558e6e0(0000) 
>GS:ffff810037c0c540(0000) knlGS:0000000000000000
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CS:  0010 DS: 0000 ES: 0000 
>CR0: 000000008005003b
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CR2: 00002b473a3a4000 CR3: 
>000000006934d000 CR4: 00000000000006e0
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Call Trace:
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8004fc47>] 
>d_find_alias+0x1c/0x38
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff800e271e>] 
>d_alloc_anon+0xc/0xf8
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff88703338>] 
>:lustre:ll_iget_for_nfs+0x608/0x7e0
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887c2366>] 
>:exportfs:find_exported_dentry+0x43/0x47b
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cf72a>] 
>:nfsd:nfsd_acceptable+0x0/0xd8
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887d35ff>] 
>:nfsd:exp_get_by_name+0x5b/0x71
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887d3bee>] 
>:nfsd:exp_find_key+0x89/0x9c
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8002e28d>] 
>__wake_up+0x38/0x4f
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8009884c>] 
>set_current_groups+0x159/0x164
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887c27e9>] 
>:exportfs:export_decode_fh+0x4b/0x52
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cfaa4>] 
>:nfsd:fh_verify+0x2a2/0x4c6
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8008a788>] 
>__activate_task+0x27/0x39
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887d00ca>] 
>:nfsd:nfsd_access+0x29/0xfc
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887d76d0>] 
>:nfsd:nfsd3_proc_access+0xa4/0xb0
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cd1db>] 
>:nfsd:nfsd_dispatch+0xd8/0x1d6
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8839a4fb>] 
>:sunrpc:svc_process+0x454/0x71b
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff800645ec>] 
>__down_read+0x12/0x92
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cd5a1>] 
>:nfsd:nfsd+0x0/0x2cb
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cd746>] 
>:nfsd:nfsd+0x1a5/0x2cb
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8005dfb1>] 
>child_rip+0xa/0x11
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cd5a1>] 
>:nfsd:nfsd+0x0/0x2cb
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff887cd5a1>] 
>:nfsd:nfsd+0x0/0x2cb
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8005dfa7>] 
>child_rip+0x0/0x11
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [<ffffffff8005d28d>] 
>tracesys+0xd5/0xe0
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel:
>
>
>Kind regards,
>Frederik
>-- 
>Frederik Ferner
>Computer Systems Administrator		phone: +44 1235 77 8624
>Diamond Light Source Ltd.		mob:   +44 7917 08 5110
>(Apologies in advance for the lines below. Some bits are a legal
>requirement and I have no control over them.)
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss



More information about the lustre-discuss mailing list