[Lustre-discuss] trying to BRW to non-existent file xyz

Tue Nov 27 01:57:29 PST 2007

Hi Bernd,

> this message just happened on a rather fresh customer system
> and is rather annoying, since it fills the logs...

Your comment that "it fills the logs" suggests to me that the client is retrying the operation indefinitely, similar to the description in CFS bug 11211 comment 8 https://bugzilla.clusterfs.com/show_bug.cgi?id=11211#c8. At that time Andreas Dilger said "the fact that it is retrying on ENOENT is wrong... should probably only happen for -ETIMEDOUT and -EIO and not other errors" but I don't see any record of this ever being changed. Bug 11211 records the fixing of a memory leak associated with the message, but no change to the looping behavior.

> After some time there are evictions

In Lustre versions that do not have the fix for bug 11211 (pre 1.4.9 according to bugzilla) the client retry loop will quite quickly consume all of the memory on the OSS node (about 40 minutes on a server with 2GB RAM in our experience) and the server will go down. Later Lustre versions will not leak memory, so the server will stay up, but the looping client will place a considerable load on it. I would not be surprised if this is the cause of your evictions.

Joe.

-----Original Message-----
From: Bernd Schubert [mailto:bs at q-leap.de]
Sent: 26 November 2007 18:01
To: Oleg Drokin; lustre-discuss at clusterfs.com
Subject: Re: [Lustre-discuss] trying to BRW to non-existent file xyz

On Monday 26 November 2007 18:33:02 you wrote:
> Hello!
>
> On Nov 26, 2007, at 12:08 PM, Bernd Schubert wrote:
> > when an OST reports "trying to BRW to non-existent file xyz", how
> > can I find
> > out which file the inode xyz belongs to?
>
> Usually there is none. Can you tell us more about situations where you
> see this?
> Were there any evictions?

I think so, this message just happened on a rather fresh customer system and
is rather annoying, since it fills the logs...
I can reproduce this rather soon here, I just need to run fsstress for some
hours. After some time there are evictions.

> One common scenario for this kind of errors is this:
> Client opens a file. File gets unlinked. Client is evicted from mds,
> mds notices
> it held last reference to a file and issue destroy request for file
> objects (effectively
> removing file objects on OSTs).
> Now if client would continue to access the file (because it was not
> evicted from ost
> or if it reconnected), you will get these errors.
>
> You can do e.g. lfs find /mountpoint -v for your fs (which I guess
> would take quite a
> while if it's big) and then grep the output for interesting objectid
> (just pay
> attention that ost index should also match).

Thanks, I will try over night, don't want to disturb the people there now.

Thanks a lot,
Bernd

--
Bernd Schubert
Q-Leap Networks GmbH