[lustre-devel] xattr stat mismatch in sanityn test 73
Ellis Wilson
elliswilson at microsoft.com
Fri Oct 8 06:04:14 PDT 2021
Hi all,
I'm trying to get to the bottom of a failure I'm seeing while running the lustre unit tests, specifically here for this test in the sanityn suite:
= sanityn test 73: getxattr should not cause xattr lock cancellation =
I'm running stock lustre 2.14.0 on Ubuntu 18.04 with Linux Kernel 5.4.0. I have six nodes total for the test: two clients, two mds, two oss. The test is running from one of the clients.
The error I get is:
============
getfattr: Removing leading '/' from absolute path names
# file: mnt/lustre/f73.sanityn
user.attr1="value1"
ELLIS: expected 2, but got 5
sanityn test_73: @@@@@@ FAIL: not cached in /mnt/lustre
Trace dump:
= /usr/lib/lustre/tests/test-framework.sh:6273:error()
= /usr/lib/lustre/tests/sanityn.sh:3557:test_73()
= /usr/lib/lustre/tests/test-framework.sh:6581:run_one()
= /usr/lib/lustre/tests/test-framework.sh:6628:run_one_logged()
= /usr/lib/lustre/tests/test-framework.sh:6455:run_test()
= /usr/lib/lustre/tests/sanityn.sh:3567:main()
============
I've instrumented the code to spit out the expected vs. discovered stat values. The failure indicates the file in question wasn't cached, but in fact the inverse is occurring -- it's both cached and hit more often than expected.
The unadulterated test code follows:
================
3549 touch $DIR1/$tfile
3550 setfattr -n user.attr1 -v value1 $DIR1/$tfile ||
3551 error "setfattr1 failed"
3552 getfattr -n user.attr1 $DIR2/$tfile || error "getfattr1 failed"
3553 getfattr -n user.attr1 $DIR1/$tfile || error "getfattr2 failed"
3554 clear_stats llite.*.stats
3555 # PR lock should be cached by now on both clients
3556 getfattr -n user.attr1 $DIR1/$tfile || error "getfattr3 failed"
3557 # 2 hits for getfattr(0)+getfattr(size)
3558 [ $(calc_stats llite.*.stats getxattr_hits) -eq 2 ] ||
3559 error "not cached in $DIR1"
================
The failure occurs on line 3558.
Manually performing these actions validates that indeed the jump is by 5, not 2:
~# lctl get_param llite.*.stats | grep hits
getxattr_hits 85 samples [reqs]
getxattr_hits 4 samples [reqs]
~# getfattr -n user.attr1 /mnt/lustre/f73.sanityn
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"
~# lctl get_param llite.*.stats | grep hits
getxattr_hits 90 samples [reqs]
getxattr_hits 4 samples [reqs]
I straced getfattr as run in the test and found it issues the following:
23262 lstat("/mnt/lustre/f73.sanityn", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", NULL, 0) = 6
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", "value1", 256) = 6
I built a small C program to replicate just the above without all of the other fluff in getfattr, and I see 1 xattr cache hit occurring for the lstat, and two xattr cache hits occurring for each call of getxattr. So it replicates the 5 xattr cache hits. It is notable that if one does NOT specify "user.attr1" and instead just uses an empty string you only get a single hit on each getxattr.
I have a patch that revises the expected stat values from 2 to 5 and from 4 to 10, and while that works in my system I wanted to know:
1. Are these changes expected? I don't know much about the xattr cache or when it's expected to be hit, but hitting twice for a single getxattr seemed high.
2. Is there any location online where I can look at release testing results for these unit tests? I wanted to see if I was alone in hitting this many times, but couldn't locate such a repository of historical test results.
Thanks for any and all help!
Best,
ellis
More information about the lustre-devel
mailing list