[Lustre-discuss] Inode errors at time of job failure

Thomas Roth t.roth at gsi.de
Fri Aug 7 05:26:30 PDT 2009


Hi Oleg,

thanks for your reply. I'm not able to reproduce this error at will,
though. There are files reported missing by our users, but I couldn't
correlate these with the ll_inode_revalidate_fini errors, at least not
directly. In fact, some of the missing files reappeared later, as
reported in bug 16377, while others are gone for good.
In comment #29 of bug 16377, Brian Murell stated that this can be caused
by on-disk corruption. A file system check on the MDT claimed to correct
a large number of problems when we had the last down time a month ago.
(The said disappearance of files wasn't correlated with this fsck  ;-)).
So I'm still not reassured concerning the health of this MDT.
We are running Lustre v 1.6.7.2 on the servers, the clients mainly still
on 1.6.5.1.

Regards,
Thomas

Oleg Drokin wrote:
> Hello!
> 
> On Aug 6, 2009, at 12:57 PM, Thomas Roth wrote:
> 
>> Hi,
>> these ll_inode_revalidate_fini errors are unfortunately quite known to
>> us.
>> So what would you guess if that happens again and again, on a number of
>> clients - MDT softly dying away?
> 
> No, I do not think this is MDT problem of any sort at present, more
> like some strange client interaction.
> Are there any negative side effects in your case aside from log clutter?
> Jobs failing or anything like that?
> 
>> Because we haven't seen any mass evictions (and no reasons for that) in
>> connection with these errors.
>> Or could the problem with the cached open files also be present if the
>> communication interruption does not show up as an eviction in the logs?
> 
> It has nothing to do with opened files if there are no evictions.
> I checked in bugzilla and found bug 16377 which looks like this report
> too. Though the logs in there are somewhat confusing.
> It almost appears as if the failing dentry is reported as a mountpoint
> by vfs, but then it is not, since following inode_revalidate call
> ends up on lustre again.
> Do you have "lookup on mtpt" sort of errors coming from namei.c?
> If you can reproduce the problem with ls or another tool at will,
> can you please execute this on a client (comment #17 in the bug 16377):
> # script
> Script started, file is typescript
> # lctl clear
> # echo -1 > /proc/sys/lnet/debug
> [ reproduce problem ]
> # lctl dk > /tmp/ls.debug
> # exit
> Script done, file is typescript
> 
> and attach your resulting ls.debug in the bug?
> 
> Also what lustre version are you using?
> 
> Bye,
>     Oleg




More information about the lustre-discuss mailing list