[Lustre-discuss] Client directory entry caching

Tue Aug 3 10:50:14 PDT 2010

Hello!

On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote:
>>> So even with the metadata going over NFS the opencache in the client
>>> seems to make quite a difference (I'm not sure how much the NFS client
>>> caches though). As expected I see no mdt activity for the NFS export
>>> once cached. I think it would be really nice to be able to enable the
>>> opencache on any lustre client. A couple of potential workloads that I
>> A simple workaround for you to enable opencache on a specific client would
>> be to add cr_flags |= MDS_OPEN_LOCK; in mdc/mdc_lib.c:mds_pack_open_flags()
> Yea that works - cheers. FYI some comparisons with a simple find on a
> remote client (~33,000 files):
> 
>  find /mnt/lustre (not cached) = 41 secs
>  find /mnt/lustre (cached) = 19 secs
>  find /mnt/lustre (opencache) = 3 secs

Hm, initially I was going to say that find is not open-intensive so it should
not benefit from opencache at all.
But then I realized if you have a lot of dirs, then indeed there would be a
positive impact on subsequent reruns.
I assume that the opencache result is a second run and first run produces
same 41 seconds?

BTW, another unintended side-effect you might experience if you have mixed
opencache enabled/disabled network is if you run something (or open for write)
on an opencache-enabled client, you might have problems writing (or executing)
that file from non-opencache enabled nodes as long as the file handle
would remain cached on the client. This is because if open lock was not requested,
we don't try to invalidate current ones (expensive) and MDS would think
the file is genuinely open for write/execution and disallow conflicting accesses
with EBUSY.

> performance when compared to something simpler like NFS. Slightly off
> topic (and I've kinda asked this before) but is there a good reason
> why link() speeds in Lustre are so slow compare to something like NFS?
> A quick comparison of doing a "cp -al" from a remote Lustre client and
> an NFS client (to a fast NFS server):
> 
>  cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec
>  cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec
> 
> Is it just the extra depth of the lustre stack/code path? Is there
> anything we could do to speed this up if we know that no other client
> will touch these dirs while we hardlink them?

Hm, this is a first complaint about this that I hear.
I just looked into strace of cp -fal (which I guess you mant instead of just -fa that
would just copy everything).

so we traverse the tree down creating a dir structure in parallel first (or just doing it in readdir order)

open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
+1 RPC

fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
+1 RPC (if no opencache)

fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
getdents(3, /* 4 entries */, 4096)      = 96
getdents(3, /* 0 entries */, 4096)      = 0
+1 RPC

close(3)                                = 0
+1 RPC (if no opencache)

lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
(should be cached, so no RPC)

mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0
+1 RPC

lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
+1 RPC

stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
(should be cached, so no RPC)

Then we get to files:
link("/mnt/lustre/a/b/c/d/e/f/g/k/8", "/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0
+1 RPC

futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k", {{1280856246, 0}, {128085
6291, 0}}) = 0
+1 RPC

then we start traversing the just created tree up and chowning it:
chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0
+1 RPC 

getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data available)
+1 RPC

stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
(not sure why another stat here, we already did it on the way up. Should be cached)

setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_access", "\x02\x00
\x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff \x00\x0
5\x00\xff\xff\xff\xff", 28, 0) = 0
+1 RPC

getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_default", 0x7fff519f09
50, 132) = -1 ENODATA (No data available)
+1 RPC

stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) =
 0
Hm, stat again? did not we do it a few syscalls back?

stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...
}) = 0
stat of the target. +1 RPC (the cache got invalidated by link above).

setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_default", "\x02\x0
0\x00\x00", 4, 0) = 0
+1 RPC

So I guess there is a certain number of stat RPCs that would not be present on NFS
due to different ways the caching works, plus all the getxattrs. Not sure if this
is enough to explain 4x rate difference.

Also you can try disabling debug (if you did not already) to see how big of an impact
that would make. It used to be that debug was affecting metadata loads a lot, though
with recent debug levels adjustments I think it was somewhat improved.

Bye,
    Oleg