[lustre-devel] xattr stat mismatch in sanityn test 73

Andreas Dilger adilger at whamcloud.com
Fri Oct 15 23:33:19 PDT 2021

The simple solution is fine with me.

Is there a way to make the kernel itself more efficient in this regard?
For example, limit the checks to a specific directory tree rather than
the whole filesystem?  Granted the cache avoids network RPCs, but it
would be even better if the VFS is skipping these unnecessary checks.

Cheers, Andreas

On Oct 15, 2021, at 08:25, Ellis Wilson <elliswilson at microsoft.com> wrote:

Hi Andreas,

Yea – I noticed that.  What makes it even funnier is that it makes allocation calls in statically defined chunk sizes, so it’s not even like it’s exactly the right size in truth.  If you look at my follow-up email you’ll see that I RC’d this to auditd mechanics in the kernel, which was enabled on my system.

I’m going to open a bug and submit a patch for the test to be more forgiving for test systems that have auditd enabled next week.  In short, it will assert that it saw at minimum two getxattr hits, rather than exactly two getxattrs hits.  That’s behaved well for me since I revised it locally.  Let me know if you see any issue with this simple solution.



From: Andreas Dilger <adilger at whamcloud.com>
Sent: Friday, October 15, 2021 2:19 AM
To: Ellis Wilson <elliswilson at microsoft.com>
Cc: lustre-devel at lists.lustre.org
Subject: [EXTERNAL] Re: [lustre-devel] xattr stat mismatch in sanityn test 73

Ellis, the getfattr code has a poor "optimization", in that it first calls getxattr(buf=NULL, size=0) to have the kernel report the xattr size to allocate a buffer of exactly the right size.  However, IMHO while this avoids a "too large" allocation (maybe a few bytes vs. 64KB), it is inefficient by doubling the number of syscalls into the filesystem.

The other getxattr cache hit during lstat() is likely because of the kernel accessing an selinux/security xattr.

On Oct 8, 2021, at 07:04, Ellis Wilson via lustre-devel <lustre-devel at lists.lustre.org<mailto:lustre-devel at lists.lustre.org>> wrote:

Hi all,

I'm trying to get to the bottom of a failure I'm seeing while running the lustre unit tests, specifically here for this test in the sanityn suite:
= sanityn test 73: getxattr should not cause xattr lock cancellation =

I'm running stock lustre 2.14.0 on Ubuntu 18.04 with Linux Kernel 5.4.0.  I have six nodes total for the test: two clients, two mds, two oss.  The test is running from one of the clients.

The error I get is:
getfattr: Removing leading '/' from absolute path names
# file: mnt/lustre/f73.sanityn

ELLIS: expected 2, but got 5
sanityn test_73: @@@@@@ FAIL: not cached in /mnt/lustre
 Trace dump:
 = /usr/lib/lustre/tests/test-framework.sh:6273:error()
 = /usr/lib/lustre/tests/sanityn.sh:3557:test_73()
 = /usr/lib/lustre/tests/test-framework.sh:6581:run_one()
 = /usr/lib/lustre/tests/test-framework.sh:6628:run_one_logged()
 = /usr/lib/lustre/tests/test-framework.sh:6455:run_test()
 = /usr/lib/lustre/tests/sanityn.sh:3567:main()

I've instrumented the code to spit out the expected vs. discovered stat values.  The failure indicates the file in question wasn't cached, but in fact the inverse is occurring -- it's both cached and hit more often than expected.

The unadulterated test code follows:
3549     touch $DIR1/$tfile
3550     setfattr -n user.attr1 -v value1 $DIR1/$tfile ||
3551         error "setfattr1 failed"
3552     getfattr -n user.attr1 $DIR2/$tfile || error "getfattr1 failed"
3553     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr2 failed"
3554     clear_stats llite.*.stats
3555     # PR lock should be cached by now on both clients
3556     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr3 failed"
3557     # 2 hits for getfattr(0)+getfattr(size)
3558     [ $(calc_stats llite.*.stats getxattr_hits) -eq 2 ] ||
3559         error "not cached in $DIR1"

The failure occurs on line 3558.

Manually performing these actions validates that indeed the jump is by 5, not 2:
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             85 samples [reqs]
getxattr_hits             4 samples [reqs]
~# getfattr -n user.attr1 /mnt/lustre/f73.sanityn
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             90 samples [reqs]
getxattr_hits             4 samples [reqs]

I straced getfattr as run in the test and found it issues the following:
23262 lstat("/mnt/lustre/f73.sanityn", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", NULL, 0) = 6
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", "value1", 256) = 6

I built a small C program to replicate just the above without all of the other fluff in getfattr, and I see 1 xattr cache hit occurring for the lstat, and two xattr cache hits occurring for each call of getxattr.  So it replicates the 5 xattr cache hits.  It is notable that if one does NOT specify "user.attr1" and instead just uses an empty string you only get a single hit on each getxattr.

I have a patch that revises the expected stat values from 2 to 5 and from 4 to 10, and while that works in my system I wanted to know:
1. Are these changes expected?  I don't know much about the xattr cache or when it's expected to be hit, but hitting twice for a single getxattr seemed high.
2. Is there any location online where I can look at release testing results for these unit tests?  I wanted to see if I was alone in hitting this many times, but couldn't locate such a repository of historical test results.

Thanks for any and all help!

Cheers, Andreas
Andreas Dilger
Lustre Principal Architect

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20211016/d7d5453e/attachment-0001.html>

More information about the lustre-devel mailing list