[Lustre-discuss] Client directory entry caching

Wed Aug 4 11:04:06 PDT 2010

Oleg,

On Tue, Aug 3, 2010 at 6:50 PM, Oleg Drokin <oleg.drokin at oracle.com> wrote:
>> Yea that works - cheers. FYI some comparisons with a simple find on a
>> remote client (~33,000 files):
>>
>>  find /mnt/lustre (not cached) = 41 secs
>>  find /mnt/lustre (cached) = 19 secs
>>  find /mnt/lustre (opencache) = 3 secs
>
> Hm, initially I was going to say that find is not open-intensive so it should
> not benefit from opencache at all.
> But then I realized if you have a lot of dirs, then indeed there would be a
> positive impact on subsequent reruns.
> I assume that the opencache result is a second run and first run produces
> same 41 seconds?

Actually I assumed it would be but I guess there must be some repeat
opens because the 1st run with opencache is actually better. I have
also set lnet.debug=0 to show the difference that makes:

  find /mnt/lustre (1st run) = 35 secs
  find /mnt/lustre (2nd run) = 17 secs
  find /mnt/lustre (1st run opencache) = 23 secs
  find /mnt/lustre (2nd run opencache) = 0.65 secs

Having lnet.debug=0 does make a difference - probably more noticeable
over millions of dirs/files. BTW I think having lots of dirs on a
100TB+ filesystem is going to be a common workload for most non-lab
lustre users so having a few clients able to cache opens for doing
read-only scans of the filesystem would be a good performance win.

BTW all my testing so far is on the LAN but I'm also thinking ahead to
how all this metadata RPC traffic will work down our >200ms link to
our office in Asia.

> BTW, another unintended side-effect you might experience if you have mixed
> opencache enabled/disabled network is if you run something (or open for write)
> on an opencache-enabled client, you might have problems writing (or executing)
> that file from non-opencache enabled nodes as long as the file handle
> would remain cached on the client. This is because if open lock was not requested,
> we don't try to invalidate current ones (expensive) and MDS would think
> the file is genuinely open for write/execution and disallow conflicting accesses
> with EBUSY.

Right - worth remembering.

>> performance when compared to something simpler like NFS. Slightly off
>> topic (and I've kinda asked this before) but is there a good reason
>> why link() speeds in Lustre are so slow compare to something like NFS?
>> A quick comparison of doing a "cp -al" from a remote Lustre client and
>> an NFS client (to a fast NFS server):
>
> Hm, this is a first complaint about this that I hear.
> I just looked into strace of cp -fal (which I guess you mant instead of just -fa that
> would just copy everything).
>
> so we traverse the tree down creating a dir structure in parallel first (or just doing it in readdir order)
>
> open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
> +1 RPC
>
> SNIP!
>
> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_default", "\x02\x0
> 0\x00\x00", 4, 0) = 0
> +1 RPC
>
> So I guess there is a certain number of stat RPCs that would not be present on NFS
> due to different ways the caching works, plus all the getxattrs. Not sure if this
> is enough to explain 4x rate difference.

Thanks for breaking down the strace for me - it is interesting to see
where the RPCs are coming from. Andreas suggested getting timing info
from the strace of "cp -al" which I've done for Lustre and a NFS
server (bluearc) while hardlinking some kernel source directories. I
added up the time (seconds) that all the syscalls in a run took with
opencache and the lustre 1.8.4 client:

syscall   lustre nfs
--------------------------
stat        7s    0.01s
lstat      36s    7s
link       29s    16s
getxattr   5s    0.29s
setxattr  30s    0.25s
open       1s    2s
mkdir      6s    3s
lchown    11s    2s
futimesat 11s    2s

It doesn't quite explain the 4:1 speed difference but the (l)stat
heavy "cp -la" is consistently that much faster on NFS. Is the NFS
server so much faster for get/setxattr because it returns "EOPNOTSUPP"
for setxattr? Can we do something similar for the Lustre client if we
don't care about extended attributes? The link() times are still
almost twice as slow on Lustre though - that may be related to a
slowish (test) MDT disk.  Like Andreas said I don't understand why
there is an setxattr RPC when we didn't get any data from getxattr but
that is probably more down to "cp" than lustre?

I also did a quick test using "rsync" to do the hardlinking (which
noticeably doesn't use get/setxattr) and now the difference in speed
is more like 2:1. The way rsync launches a couple of instances
(sender/receiver) probably helps parallelise things better for Lustre.
In this case the overall link() time is similar but the overall
lstat() time is 3x slower for Lustre (67secs/22secs). The lstat()
should all be coming from the client cache but it seems to be
consistently 3x slower in each call than for NFS for some reason.

Cheers,

Daire