[Lustre-discuss] MDS LBUG: ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed

Frederik Ferner frederik.ferner at diamond.ac.uk
Tue Aug 23 06:44:22 PDT 2011


All,

I'd like to follow up on this as I can now repeatedly reproduce this on 
our test file system. I've managed to reproduce it on an version up to 
lustre 1.8.6-wc1 on the MDS that I've tried so far.

I've also reported it as LU-534 
(http://jira.whamcloud.com/browse/LU-534) and included current stack 
traces etc.

I'll repeat the basic instructions how to reproduce here:

Export a Lustre file system via NFS(v3) from a Lustre client, mount it 
on one other system over NFS, run racer on the file system over NFS, 
after a few minutes (sometimes one or two hours) the MDS LBUGs with the 
ASSERTION in the subject.

If anyone has any suggestions of debug flags to enable or other ideas 
how to track down the exact problem, I'd like to hear them.

Kind regards,
Frederik

On 08/07/11 14:26, Frederik Ferner wrote:
> All,
>
> we are experiencing what looks like the same MDS LBUG with increasing
> frequency, see below for a sample stack trace. This seems to affect only
> one client at a time and even this client will recover after some time
> (usually minutes but sometimes longer) and continue to work even without
> requiring immediate MDS reboots.
>
> In the recent past, it seems to have affected one specific client more
> often than others. This client is mainly a NFS exporter for the Lustre
> file system. All attempts to trigger the LBUG with known actions have
> been unsuccessful so far. Attempts to trigger it on the test file system
> have equally not been successful but we are still working on this.
>
> As far as I can see, this could be this bug
> https://bugzilla.lustre.org/show_bug.cgi?id=17764 but there has been no
> recent activity. And I'm not entirely sure this is the same bug.
>
> As far as I can see the log dumps don't contain any useful information,
>     but I'm happy to provide as sample file if someone offers to look at
> it.
>
> I'm also looking for suggestions how to go about debugging this problem,
> ideally initially with as little performance impact as possible so we
> might apply it on the productions system until we can reproduce it on a
> test file system. Once we can reproduce it on the test file system,
> debugging with performance implications should be possible as well.
>
> The MDS and clients are currently running Lustre 1.8.3.ddn3.3 on Red Hat
> Enterprise 5.
>
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: 4037:0:(mds_open.c:1295:mds_open()) ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: dchild 8ad94b2:0cae8d46 (ffff8101995b0300) inode ffff81041d4e8548/145593522/21276602
>> 2
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: 4037:0:(mds_open.c:1295:mds_open()) LBUG
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Lustre: 4037:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 4037
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ll_mdt_49     R  running task       0  4037      1          4038  4036 (L-TLB)
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  ffff810226da0d00 ffff810247120000 0000000000000286 0000000000000082
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  0000008100001400 ffff8101db219ef8 0000000000000001 0000000000000001
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  ffff8101ead74db8 0000000000000000 ffff810423223e10 ffffffff8008aee7
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Call Trace:
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8008aee7>] __wake_up_common+0x3e/0x68
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff887acee8>] :ptlrpc:ptlrpc_main+0x1258/0x1420
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8008cabd>] default_wake_function+0x0/0xe
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff800b7310>] audit_syscall_exit+0x336/0x362
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff887abc90>] :ptlrpc:ptlrpc_main+0x0/0x1420
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: dumping log to /tmp/lustre-log.1309949645.4037
>
> Kind regards,
> Frederik


-- 
Frederik Ferner
Computer Systems Administrator		phone: +44 1235 77 8624
Diamond Light Source Ltd.		mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)

-- 
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 






More information about the lustre-discuss mailing list