[Lustre-discuss] Client directory entry caching

Kevin Van Maren Kevin.Van.Maren at oracle.com
Tue Aug 3 11:45:07 PDT 2010


Since Bug 22492 hit a lot of people, it sounds like opencache isn't  
generally useful unless enabled on every node. Is there an easy way to  
force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)?

Kevin


On Aug 3, 2010, at 11:50 AM, Oleg Drokin <oleg.drokin at oracle.com> wrote:

> Hello!
>
> On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote:
>>>> So even with the metadata going over NFS the opencache in the  
>>>> client
>>>> seems to make quite a difference (I'm not sure how much the NFS  
>>>> client
>>>> caches though). As expected I see no mdt activity for the NFS  
>>>> export
>>>> once cached. I think it would be really nice to be able to enable  
>>>> the
>>>> opencache on any lustre client. A couple of potential workloads  
>>>> that I
>>> A simple workaround for you to enable opencache on a specific  
>>> client would
>>> be to add cr_flags |= MDS_OPEN_LOCK; in mdc/ 
>>> mdc_lib.c:mds_pack_open_flags()
>> Yea that works - cheers. FYI some comparisons with a simple find on a
>> remote client (~33,000 files):
>>
>> find /mnt/lustre (not cached) = 41 secs
>> find /mnt/lustre (cached) = 19 secs
>> find /mnt/lustre (opencache) = 3 secs
>
> Hm, initially I was going to say that find is not open-intensive so  
> it should
> not benefit from opencache at all.
> But then I realized if you have a lot of dirs, then indeed there  
> would be a
> positive impact on subsequent reruns.
> I assume that the opencache result is a second run and first run  
> produces
> same 41 seconds?
>
> BTW, another unintended side-effect you might experience if you have  
> mixed
> opencache enabled/disabled network is if you run something (or open  
> for write)
> on an opencache-enabled client, you might have problems writing (or  
> executing)
> that file from non-opencache enabled nodes as long as the file handle
> would remain cached on the client. This is because if open lock was  
> not requested,
> we don't try to invalidate current ones (expensive) and MDS would  
> think
> the file is genuinely open for write/execution and disallow  
> conflicting accesses
> with EBUSY.
>
>> performance when compared to something simpler like NFS. Slightly off
>> topic (and I've kinda asked this before) but is there a good reason
>> why link() speeds in Lustre are so slow compare to something like  
>> NFS?
>> A quick comparison of doing a "cp -al" from a remote Lustre client  
>> and
>> an NFS client (to a fast NFS server):
>>
>> cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec
>> cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec
>>
>> Is it just the extra depth of the lustre stack/code path? Is there
>> anything we could do to speed this up if we know that no other client
>> will touch these dirs while we hardlink them?
>
> Hm, this is a first complaint about this that I hear.
> I just looked into strace of cp -fal (which I guess you mant instead  
> of just -fa that
> would just copy everything).
>
> so we traverse the tree down creating a dir structure in parallel  
> first (or just doing it in readdir order)
>
> open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
> +1 RPC
>
> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> +1 RPC (if no opencache)
>
> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> getdents(3, /* 4 entries */, 4096)      = 96
> getdents(3, /* 0 entries */, 4096)      = 0
> +1 RPC
>
> close(3)                                = 0
> +1 RPC (if no opencache)
>
> lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...}) = 0
> (should be cached, so no RPC)
>
> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0
> +1 RPC
>
> lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...}) = 0
> +1 RPC
>
> stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...}) = 0
> (should be cached, so no RPC)
>
> Then we get to files:
> link("/mnt/lustre/a/b/c/d/e/f/g/k/8", "/mnt/lustre/blah2/b/c/d/e/f/g/ 
> k/8") = 0
> +1 RPC
>
> futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k", {{1280856246,  
> 0}, {128085
> 6291, 0}}) = 0
> +1 RPC
>
> then we start traversing the just created tree up and chowning it:
> chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0
> +1 RPC
>
> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_access",  
> 0x7fff519f0950, 132) = -1 ENODATA (No data available)
> +1 RPC
>
> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...}) = 0
> (not sure why another stat here, we already did it on the way up.  
> Should be cached)
>
> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",  
> "system.posix_acl_access", "\x02\x00
> \x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff 
> \xff \x00\x0
> 5\x00\xff\xff\xff\xff", 28, 0) = 0
> +1 RPC
>
> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_default",  
> 0x7fff519f09
> 50, 132) = -1 ENODATA (No data available)
> +1 RPC
>
> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...}) =
> 0
> Hm, stat again? did not we do it a few syscalls back?
>
> stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...
> }) = 0
> stat of the target. +1 RPC (the cache got invalidated by link above).
>
> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",  
> "system.posix_acl_default", "\x02\x0
> 0\x00\x00", 4, 0) = 0
> +1 RPC
>
>
> So I guess there is a certain number of stat RPCs that would not be  
> present on NFS
> due to different ways the caching works, plus all the getxattrs. Not  
> sure if this
> is enough to explain 4x rate difference.
>
> Also you can try disabling debug (if you did not already) to see how  
> big of an impact
> that would make. It used to be that debug was affecting metadata  
> loads a lot, though
> with recent debug levels adjustments I think it was somewhat improved.
>
> Bye,
>    Oleg
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100803/071c2046/attachment.htm>


More information about the lustre-discuss mailing list