[Lustre-discuss] lustre + nfs

Oleg Drokin Oleg.Drokin at Sun.COM
Fri Jan 11 13:31:04 PST 2008


Hello!

On Jan 11, 2008, at 1:06 PM, Aaron Knister wrote:

> I'm running lustre 1.6.4.1 with patches 14006 14007 and 14008 applied.
> They all relate to nfs. On the system serving the nfs mounts I
> frequently see this --
> LustreError: 11-0: an error occurred while communicating with
> 192.168.64.70 at o2ib. The mds_getattr_lock operation failed with -13
> LustreError: 28508:0:(llite_nfs.c:243:ll_get_parent()) failure -13
> inode 22878046 get parent

So, during getparent we got EACCESS, that's real strange. Would be
interesting if you could isolate this to some specific usecase too.
Could it be that permissions are turned into more restricted
on some directories at the time when some nfs clients work
inside of those subdirectories?
Though on the second thought - that should not be an issue because NFS
is always operating as root, right? No, wrong! Actually it operates
under user specified, so could it be somebody changes permissions to  
more
restrictive on parts of the exported lustre tree from time to time,
for example?
You might be able to find out that inode 22878046 and find out what  
permissions
are there.

> And periodically (every 8 hours or so) the server crashes under load.
> The following error is found on the MDS and OSSs. --
> LustreError: 138-a: data-MDT0000: A client on nid 192.168.64.49 at o2ib
> was evicted due to a lock blocking callback to 192.168.64.49 at o2ib
> timed out: rc -107
> Lustre: MGS: haven't heard from client
> c0197fd1-42b6-5517-49f4-43470769cc6d (at 192.168.64.49 at o2ib) in 238
> seconds. I think it's dead, and I am evicting it.

This is not server crash message, this is just a message telling us
that this client (that is nfs server in itself, I guess) supposedly died
because it does not reply to us anymore.
What about messages from that dying node, can we see any, please?

Other important patches you need for nfs: 14360 for avoiding lockups,  
13371 for
general i/o speedup (not strictly necessary - does not crash without  
it),
14379 to avoid assertions on too long cancel lists.

Bye,
     Oleg




More information about the lustre-discuss mailing list