[lustre-devel] xattr stat mismatch in sanityn test 73

Andreas Dilger adilger at whamcloud.com
Thu Oct 14 23:18:47 PDT 2021


Ellis, the getfattr code has a poor "optimization", in that it first calls getxattr(buf=NULL, size=0) to have the kernel report the xattr size to allocate a buffer of exactly the right size.  However, IMHO while this avoids a "too large" allocation (maybe a few bytes vs. 64KB), it is inefficient by doubling the number of syscalls into the filesystem.

The other getxattr cache hit during lstat() is likely because of the kernel accessing an selinux/security xattr.

On Oct 8, 2021, at 07:04, Ellis Wilson via lustre-devel <lustre-devel at lists.lustre.org<mailto:lustre-devel at lists.lustre.org>> wrote:

Hi all,

I'm trying to get to the bottom of a failure I'm seeing while running the lustre unit tests, specifically here for this test in the sanityn suite:
= sanityn test 73: getxattr should not cause xattr lock cancellation =

I'm running stock lustre 2.14.0 on Ubuntu 18.04 with Linux Kernel 5.4.0.  I have six nodes total for the test: two clients, two mds, two oss.  The test is running from one of the clients.

The error I get is:
============
getfattr: Removing leading '/' from absolute path names
# file: mnt/lustre/f73.sanityn
user.attr1="value1"

ELLIS: expected 2, but got 5
sanityn test_73: @@@@@@ FAIL: not cached in /mnt/lustre
 Trace dump:
 = /usr/lib/lustre/tests/test-framework.sh:6273:error()
 = /usr/lib/lustre/tests/sanityn.sh:3557:test_73()
 = /usr/lib/lustre/tests/test-framework.sh:6581:run_one()
 = /usr/lib/lustre/tests/test-framework.sh:6628:run_one_logged()
 = /usr/lib/lustre/tests/test-framework.sh:6455:run_test()
 = /usr/lib/lustre/tests/sanityn.sh:3567:main()
============

I've instrumented the code to spit out the expected vs. discovered stat values.  The failure indicates the file in question wasn't cached, but in fact the inverse is occurring -- it's both cached and hit more often than expected.

The unadulterated test code follows:
================
3549     touch $DIR1/$tfile
3550     setfattr -n user.attr1 -v value1 $DIR1/$tfile ||
3551         error "setfattr1 failed"
3552     getfattr -n user.attr1 $DIR2/$tfile || error "getfattr1 failed"
3553     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr2 failed"
3554     clear_stats llite.*.stats
3555     # PR lock should be cached by now on both clients
3556     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr3 failed"
3557     # 2 hits for getfattr(0)+getfattr(size)
3558     [ $(calc_stats llite.*.stats getxattr_hits) -eq 2 ] ||
3559         error "not cached in $DIR1"
================

The failure occurs on line 3558.

Manually performing these actions validates that indeed the jump is by 5, not 2:
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             85 samples [reqs]
getxattr_hits             4 samples [reqs]
~# getfattr -n user.attr1 /mnt/lustre/f73.sanityn
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             90 samples [reqs]
getxattr_hits             4 samples [reqs]

I straced getfattr as run in the test and found it issues the following:
23262 lstat("/mnt/lustre/f73.sanityn", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", NULL, 0) = 6
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", "value1", 256) = 6

I built a small C program to replicate just the above without all of the other fluff in getfattr, and I see 1 xattr cache hit occurring for the lstat, and two xattr cache hits occurring for each call of getxattr.  So it replicates the 5 xattr cache hits.  It is notable that if one does NOT specify "user.attr1" and instead just uses an empty string you only get a single hit on each getxattr.

I have a patch that revises the expected stat values from 2 to 5 and from 4 to 10, and while that works in my system I wanted to know:
1. Are these changes expected?  I don't know much about the xattr cache or when it's expected to be hit, but hitting twice for a single getxattr seemed high.
2. Is there any location online where I can look at release testing results for these unit tests?  I wanted to see if I was alone in hitting this many times, but couldn't locate such a repository of historical test results.

Thanks for any and all help!

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20211015/c33fd991/attachment.html>


More information about the lustre-devel mailing list