[Lustre-discuss] Client directory entry caching

Wed Aug 4 00:57:35 PDT 2010

22492 is exist because someone from Sun/Oracle is disable dentry caching, instead of fixing xattr code for 1.8<>2.0 interoperable.
He is kill my patch in revalidate dentry (instead of fixing xattr code).
in that case client always send one more RPC to server.
see bug 17545 for some details.

On Aug 4, 2010, at 10:41, Andreas Dilger wrote:

> On 2010-08-03, at 12:45, Kevin Van Maren wrote:
>> Since Bug 22492 hit a lot of people, it sounds like opencache isn't generally useful unless enabled on every node. Is there an easy way to force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)?
> 
> For Lustre you can do "lctl set_param ldlm.namespaces.*.lru_size=clear" will drop all the DLM locks on the clients, which will flush all pages from the cache.
> 
>>> I just looked into strace of cp -fal (which I guess you meant instead of just -fa that would just copy everything).
>>> 
>>> so we traverse the tree down creating a dir structure in parallel first (or just doing it in readdir order)
>>> 
>>> open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
>>> +1 RPC
>>> 
>>> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> +1 RPC (if no opencache)
>>> 
>>> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
>>> getdents(3, /* 4 entries */, 4096)      = 96
>>> getdents(3, /* 0 entries */, 4096)      = 0
>>> +1 RPC
> 
> Having large readdir RPCs would help for directories with more than about 170 entries.
> 
>>> close(3)                                = 0
>>> +1 RPC (if no opencache)
>>> 
>>> lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> (should be cached, so no RPC)
>>> 
>>> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0
>>> +1 RPC
>>> 
>>> lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> +1 RPC
> 
> If we do the mkdir(), the client does not cache the entry?
> 
>>> stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> (should be cached, so no RPC)
>>> 
>>> Then we get to files:
>>> link("/mnt/lustre/a/b/c/d/e/f/g/k/8", "/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0
>>> +1 RPC
>>> 
>>> futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k", {{1280856246, 0}, {1280856291, 0}}) = 0
>>> +1 RPC
>>> 
>>> then we start traversing the just created tree up and chowning it:
>>> chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0
>>> +1 RPC 
>>> 
>>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data available)
>>> +1 RPC
> 
> This is gone in 1.8.4
> 
>>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> (not sure why another stat here, we already did it on the way up. Should be cached)
>>> 
>>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_access", "\x02\x00\x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff \x00\x05\x00\xff\xff\xff\xff", 28, 0) = 0
>>> +1 RPC
> 
> Strange that it is setting an ACL when it didn't read one?
> 
>>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_default", 0x7fff519f0950, 132) = -1 ENODATA (No data available)
>>> +1 RPC
>>> 
>>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> Hm, stat again? did not we do it a few syscalls back?
> 
> Gotta love those GNU file utilities.  They are very stat happy.
> 
>>> stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ... }) = 0
>>> stat of the target. +1 RPC (the cache got invalidated by link above).
>>> 
>>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_default", "\x02\x00\x00\x00", 4, 0) = 0
>>> +1 RPC
> 
> Here it is also setting an ACL even though it didn't get one from the source.
> 
>>> So I guess there is a certain number of stat RPCs that would not be present on NFS due to different ways the caching works, plus all the getxattrs. Not sure if this is enough to explain 4x rate difference.
>>> 
>>> Also you can try disabling debug (if you did not already) to see how big of an impact that would make. It used to be that debug was affecting metadata loads a lot, though with recent debug levels adjustments I think it was somewhat improved.
> 
> Useful would be to run "strace -tttT" to get timestamps for each operation to see for which operations it is slower on Lustre than NFS.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss