[Lustre-discuss] Nodes claim error with files, then say everything is fine.

Brian J. Murrell Brian.Murrell at Sun.COM
Wed Aug 6 09:17:33 PDT 2008


On Wed, 2008-08-06 at 09:29 -0600, Chris Worley wrote:
> On Wed, Aug 6, 2008 at 9:15 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:
> >
> > So, now what does the MDS serving lfs-MDT0000 say about this?  Why did
> > it evict?  What version of Lustre is this?  Perhaps you said so already
> > and I have just forgotten.
> 
> 1.6.5.1 clients w/ 1.6.4.3 OSS's.
> 
> The MDS is very verbose.  I get these all the time, even prior to the error:
> 
> Lustre: lfs-OST0000: haven't heard from client
> 12f00621-096c-b331-8774-abfc72dfd82
> 2 (at 36.102.36.15 at o2ib) in 92 seconds. I think it's dead, and I am evicting it.

Yup.  If you can correlate those kinds of messages (they have the client
ip address in them) to the errors on the client, you have your eviction
events.

I notice that you are getting messages out of dmesg rather than syslog.
Syslog makes correlation easier and more definite due to the time
stamps.

But this kind of eviction is simply due to clients that are unresponsive
from the POV of the MDS.  They are neither making filesystem RPC nor are
they "ping"ing (keepalives) so the MDS assumes they have died and evicts
them to get back the locks it could be holding and not having that dead
client holding up other, living clients.

So you need to investigate why the clients are dying or appear to be
dead (i.e. going silent) to the MDS.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080806/8066b18a/attachment.pgp>


More information about the lustre-discuss mailing list