[lustre-devel] xattr stat mismatch in sanityn test 73

Ellis Wilson elliswilson at microsoft.com
Fri Oct 8 06:04:14 PDT 2021

Hi all,

I'm trying to get to the bottom of a failure I'm seeing while running the lustre unit tests, specifically here for this test in the sanityn suite:
= sanityn test 73: getxattr should not cause xattr lock cancellation =

I'm running stock lustre 2.14.0 on Ubuntu 18.04 with Linux Kernel 5.4.0.  I have six nodes total for the test: two clients, two mds, two oss.  The test is running from one of the clients.

The error I get is:
getfattr: Removing leading '/' from absolute path names
# file: mnt/lustre/f73.sanityn

ELLIS: expected 2, but got 5
 sanityn test_73: @@@@@@ FAIL: not cached in /mnt/lustre 
  Trace dump:
  = /usr/lib/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib/lustre/tests/sanityn.sh:3557:test_73()
  = /usr/lib/lustre/tests/test-framework.sh:6581:run_one()
  = /usr/lib/lustre/tests/test-framework.sh:6628:run_one_logged()
  = /usr/lib/lustre/tests/test-framework.sh:6455:run_test()
  = /usr/lib/lustre/tests/sanityn.sh:3567:main()

I've instrumented the code to spit out the expected vs. discovered stat values.  The failure indicates the file in question wasn't cached, but in fact the inverse is occurring -- it's both cached and hit more often than expected.

The unadulterated test code follows:
3549     touch $DIR1/$tfile
3550     setfattr -n user.attr1 -v value1 $DIR1/$tfile ||
3551         error "setfattr1 failed"
3552     getfattr -n user.attr1 $DIR2/$tfile || error "getfattr1 failed"
3553     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr2 failed"
3554     clear_stats llite.*.stats
3555     # PR lock should be cached by now on both clients
3556     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr3 failed"
3557     # 2 hits for getfattr(0)+getfattr(size)
3558     [ $(calc_stats llite.*.stats getxattr_hits) -eq 2 ] ||
3559         error "not cached in $DIR1"

The failure occurs on line 3558.

Manually performing these actions validates that indeed the jump is by 5, not 2:
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             85 samples [reqs]
getxattr_hits             4 samples [reqs]
~# getfattr -n user.attr1 /mnt/lustre/f73.sanityn
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             90 samples [reqs]
getxattr_hits             4 samples [reqs]

I straced getfattr as run in the test and found it issues the following:
23262 lstat("/mnt/lustre/f73.sanityn", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", NULL, 0) = 6
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", "value1", 256) = 6

I built a small C program to replicate just the above without all of the other fluff in getfattr, and I see 1 xattr cache hit occurring for the lstat, and two xattr cache hits occurring for each call of getxattr.  So it replicates the 5 xattr cache hits.  It is notable that if one does NOT specify "user.attr1" and instead just uses an empty string you only get a single hit on each getxattr.

I have a patch that revises the expected stat values from 2 to 5 and from 4 to 10, and while that works in my system I wanted to know:
1. Are these changes expected?  I don't know much about the xattr cache or when it's expected to be hit, but hitting twice for a single getxattr seemed high.
2. Is there any location online where I can look at release testing results for these unit tests?  I wanted to see if I was alone in hitting this many times, but couldn't locate such a repository of historical test results.

Thanks for any and all help!



More information about the lustre-devel mailing list