[lustre-devel] xattr stat mismatch in sanityn test 73

Ellis Wilson elliswilson at microsoft.com
Mon Oct 11 18:31:08 PDT 2021


Apologies if somebody responded and it got spammed or I missed it -- in any event I got to root cause and figured I'd answer my own questions in case somebody went looking.

The additional three getxattrs are caused by inode audit code in the kernel, which is triggered if there are any filesystem specific rules in-place (e.g., log to audit file if directory /etc/auditd is modified).

You can see the additional stack traces with eBPF stackcount:

  ll_xattr_cache_get
  ll_xattr_get_common
  __vfs_getxattr
  vfs_getxattr
  getxattr
  path_getxattr
  __x64_sys_getxattr
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  [unknown]
  [unknown]
    getfattr [19698]
    1  

  ll_xattr_cache_get
  ll_xattr_get_common
  __vfs_getxattr
  vfs_getxattr
  getxattr
  path_getxattr
  __x64_sys_getxattr
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  [unknown]
  [unknown]
    getfattr [19698]
    1  

  ll_xattr_cache_get
  ll_xattr_get_common
  __vfs_getxattr
  get_vfs_caps_from_disk
  audit_copy_inode
  __audit_inode
  filename_lookup
  user_path_at_empty
  vfs_statx
  __do_sys_newlstat
  __x64_sys_newlstat
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  [unknown]
    getfattr [19698]
    1  

  ll_xattr_cache_get
  ll_xattr_get_common
  __vfs_getxattr
  get_vfs_caps_from_disk
  audit_copy_inode
  __audit_inode
  filename_lookup
  user_path_at_empty
  path_getxattr
  __x64_sys_getxattr
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  [unknown]
  [unknown]
    getfattr [19698]
    2

Only the ones without "__audit_inode" in them are doing "real" work.  The remaining three are simply performing getxattrs against the path being operated on as part of filename_lookup for audit.  So long as auditctl shows enabled AND there is at least a single rule that relates to the filesystem (e.g., audit all modifications to /etc/audit), every single lookup will result in a get(x)attr to get VFS caps.  Relevant call path through the kernel is:

filename_lookup
audit_inode
__audit_inode
audit_copy_inode
audit_copy_fcaps
get_vfs_caps_from_disk

Disabling auditd completely via auditctl -e0 returns the total number of getxattrs to the expected value of 2.

Overall, the check in-place for this test is arguably too overfitted.  At some point I'll put together a PR such that the check is assuring there are at least 2 getxattr_hits rather than exactly 2 unless somebody objects to that solution.

Also, I found maloo and got an account so I could review upstream unit test results.

Best,

ellis


-----Original Message-----
From: Ellis Wilson 
Sent: Friday, October 8, 2021 9:04 AM
To: lustre-devel at lists.lustre.org
Subject: xattr stat mismatch in sanityn test 73

Hi all,

I'm trying to get to the bottom of a failure I'm seeing while running the lustre unit tests, specifically here for this test in the sanityn suite:
= sanityn test 73: getxattr should not cause xattr lock cancellation =

I'm running stock lustre 2.14.0 on Ubuntu 18.04 with Linux Kernel 5.4.0.  I have six nodes total for the test: two clients, two mds, two oss.  The test is running from one of the clients.

The error I get is:
============
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"

ELLIS: expected 2, but got 5
 sanityn test_73: @@@@@@ FAIL: not cached in /mnt/lustre
  Trace dump:
  = /usr/lib/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib/lustre/tests/sanityn.sh:3557:test_73()
  = /usr/lib/lustre/tests/test-framework.sh:6581:run_one()
  = /usr/lib/lustre/tests/test-framework.sh:6628:run_one_logged()
  = /usr/lib/lustre/tests/test-framework.sh:6455:run_test()
  = /usr/lib/lustre/tests/sanityn.sh:3567:main()
============

I've instrumented the code to spit out the expected vs. discovered stat values.  The failure indicates the file in question wasn't cached, but in fact the inverse is occurring -- it's both cached and hit more often than expected.

The unadulterated test code follows:
================
3549     touch $DIR1/$tfile
3550     setfattr -n user.attr1 -v value1 $DIR1/$tfile ||
3551         error "setfattr1 failed"
3552     getfattr -n user.attr1 $DIR2/$tfile || error "getfattr1 failed"
3553     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr2 failed"
3554     clear_stats llite.*.stats
3555     # PR lock should be cached by now on both clients
3556     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr3 failed"
3557     # 2 hits for getfattr(0)+getfattr(size)
3558     [ $(calc_stats llite.*.stats getxattr_hits) -eq 2 ] ||
3559         error "not cached in $DIR1"
================

The failure occurs on line 3558.

Manually performing these actions validates that indeed the jump is by 5, not 2:
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             85 samples [reqs]
getxattr_hits             4 samples [reqs]
~# getfattr -n user.attr1 /mnt/lustre/f73.sanityn
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             90 samples [reqs]
getxattr_hits             4 samples [reqs]

I straced getfattr as run in the test and found it issues the following:
23262 lstat("/mnt/lustre/f73.sanityn", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", NULL, 0) = 6
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", "value1", 256) = 6

I built a small C program to replicate just the above without all of the other fluff in getfattr, and I see 1 xattr cache hit occurring for the lstat, and two xattr cache hits occurring for each call of getxattr.  So it replicates the 5 xattr cache hits.  It is notable that if one does NOT specify "user.attr1" and instead just uses an empty string you only get a single hit on each getxattr.

I have a patch that revises the expected stat values from 2 to 5 and from 4 to 10, and while that works in my system I wanted to know:
1. Are these changes expected?  I don't know much about the xattr cache or when it's expected to be hit, but hitting twice for a single getxattr seemed high.
2. Is there any location online where I can look at release testing results for these unit tests?  I wanted to see if I was alone in hitting this many times, but couldn't locate such a repository of historical test results.

Thanks for any and all help!

Best,

ellis


More information about the lustre-devel mailing list