[Lustre-discuss] Client directory entry caching

Tue Aug 3 11:50:06 PDT 2010

Well, you can drop all locks on a given FS that would in effect drop all metadata caches, but will leave
data caches intact.

echo clear >/proc/fs/lustre/ldlm/namespaces/your_MDC_namespace/lru_size

On Aug 3, 2010, at 2:45 PM, Kevin Van Maren wrote:

> Since Bug 22492 hit a lot of people, it sounds like opencache isn't generally useful unless enabled on every node. Is there an easy way to force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)?
> 
> Kevin
> 
> 
> On Aug 3, 2010, at 11:50 AM, Oleg Drokin <oleg.drokin at oracle.com> wrote:
> 
>> Hello!
>> 
>> On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote:
>>>>> So even with the metadata going over NFS the opencache in the client
>>>>> seems to make quite a difference (I'm not sure how much the NFS client
>>>>> caches though). As expected I see no mdt activity for the NFS export
>>>>> once cached. I think it would be really nice to be able to enable the
>>>>> opencache on any lustre client. A couple of potential workloads that I
>>>> A simple workaround for you to enable opencache on a specific client would
>>>> be to add cr_flags |= MDS_OPEN_LOCK; in mdc/mdc_lib.c:mds_pack_open_flags()
>>> Yea that works - cheers. FYI some comparisons with a simple find on a
>>> remote client (~33,000 files):
>>> 
>>> find /mnt/lustre (not cached) = 41 secs
>>> find /mnt/lustre (cached) = 19 secs
>>> find /mnt/lustre (opencache) = 3 secs
>> 
>> Hm, initially I was going to say that find is not open-intensive so it should
>> not benefit from opencache at all.
>> But then I realized if you have a lot of dirs, then indeed there would be a
>> positive impact on subsequent reruns.
>> I assume that the opencache result is a second run and first run produces
>> same 41 seconds?
>> 
>> BTW, another unintended side-effect you might experience if you have mixed
>> opencache enabled/disabled network is if you run something (or open for write)
>> on an opencache-enabled client, you might have problems writing (or executing)
>> that file from non-opencache enabled nodes as long as the file handle
>> would remain cached on the client. This is because if open lock was not requested,
>> we don't try to invalidate current ones (expensive) and MDS would think
>> the file is genuinely open for write/execution and disallow conflicting accesses
>> with EBUSY.
>> 
>>> performance when compared to something simpler like NFS. Slightly off
>>> topic (and I've kinda asked this before) but is there a good reason
>>> why link() speeds in Lustre are so slow compare to something like NFS?
>>> A quick comparison of doing a "cp -al" from a remote Lustre client and
>>> an NFS client (to a fast NFS server):
>>> 
>>> cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec
>>> cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec
>>> 
>>> Is it just the extra depth of the lustre stack/code path? Is there
>>> anything we could do to speed this up if we know that no other client
>>> will touch these dirs while we hardlink them?
>> 
>> Hm, this is a first complaint about this that I hear.
>> I just looked into strace of cp -fal (which I guess you mant instead of just -fa that
>> would just copy everything).
>> 
>> so we traverse the tree down creating a dir structure in parallel first (or just doing it in readdir order)
>> 
>> open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
>> +1 RPC
>> 
>> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>> +1 RPC (if no opencache)
>> 
>> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
>> getdents(3, /* 4 entries */, 4096)      = 96
>> getdents(3, /* 0 entries */, 4096)      = 0
>> +1 RPC
>> 
>> close(3)                                = 0
>> +1 RPC (if no opencache)
>> 
>> lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>> (should be cached, so no RPC)
>> 
>> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0
>> +1 RPC
>> 
>> lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>> +1 RPC
>> 
>> stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>> (should be cached, so no RPC)
>> 
>> Then we get to files:
>> link("/mnt/lustre/a/b/c/d/e/f/g/k/8", "/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0
>> +1 RPC
>> 
>> futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k", {{1280856246, 0}, {128085
>> 6291, 0}}) = 0
>> +1 RPC
>> 
>> then we start traversing the just created tree up and chowning it:
>> chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0
>> +1 RPC 
>> 
>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data available)
>> +1 RPC
>> 
>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>> (not sure why another stat here, we already did it on the way up. Should be cached)
>> 
>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_access", "\x02\x00
>> \x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff \x00\x0
>> 5\x00\xff\xff\xff\xff", 28, 0) = 0
>> +1 RPC
>> 
>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_default", 0x7fff519f09
>> 50, 132) = -1 ENODATA (No data available)
>> +1 RPC
>> 
>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) =
>> 0
>> Hm, stat again? did not we do it a few syscalls back?
>> 
>> stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...
>> }) = 0
>> stat of the target. +1 RPC (the cache got invalidated by link above).
>> 
>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_default", "\x02\x0
>> 0\x00\x00", 4, 0) = 0
>> +1 RPC
>> 
>> 
>> So I guess there is a certain number of stat RPCs that would not be present on NFS
>> due to different ways the caching works, plus all the getxattrs. Not sure if this
>> is enough to explain 4x rate difference.
>> 
>> Also you can try disabling debug (if you did not already) to see how big of an impact
>> that would make. It used to be that debug was affecting metadata loads a lot, though
>> with recent debug levels adjustments I think it was somewhat improved.
>> 
>> Bye,
>>    Oleg
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss