[lustre-discuss] Understanding MDT getxattr stats

Tue Sep 25 15:58:56 PDT 2018

On Sep 25, 2018, at 23:01, Kirk, Benjamin (JSC-EG311) <benjamin.kirk at nasa.gov> wrote:
> 
> Hi all,
> 
> We’re using jobstats under SLURM and have pulled together a tool to integrate SLURM job info and lustre OST/MDT jobstats.  The idea is to correlate filesystem use cases with particular applications as targets for refactoring.
> 
> In doing so, I’m seeing some applications really trigger getxattr on the MDT, and others do not.  A particular egregious example is below:  360 cores, ~10s of GB output, ~6500 files, but 16,608,476 calls to getxattr during a 4 hour runtime. And this is a nominally compute-bound problem, so that I/O pattern is likely compressed into small windows of time.
> 
> The system is CentOS 7.5 / lustre 2.10.5 / zfs-0.7.9, single mdt, 12 OSS, 2 OST each.  Default stripe count of 4.
> 
> A couple questions:
> 
> 1) should I care about this?  We do see sporadic mdt slowness under zfs, but that doesn’t seem rare.  I’m looking for a good way to trace that to jobs / use cases.

Having SLURM report the JobID stats from the servers seems like a great idea to me.  This makes IO much more visible to users/developers, an they can start to get a feeling about whether they are dong a lot or a little IO.

I pushed a patch recently that reports the start of the jobstats data in the output, so one can get a better idea about the IO rates involved.

I've also wondered whether we should keep an IO histogram for each JobID (like brw_stats), but I wonder if count and sum is enough to get the average IO size, and maybe sum_squared to calculate stddev?

> 2) what types of operations might be triggering the getxattr usage on a moderate amount of files (e.g. what to watch for in the refactoring process…)

There are a number of different possibilities:
- spurious SELinux security checks
- ACLs (which are stored as xattrs on disk)
- user xattrs (if you have this enabled)
- xattrs are too large to fit into cache?

There is already an xattr cache on the client, but it doesn't cache very large xattrs.  You could try running strace on the running program to see what it is doing.  If you know the input/output files you could check with getfattr and getqfacl to see what xattrs are stored there.  With 17M calls for the job in 5 minutes, that is over 50k/sec.  While it is great the MDS can handle this load, it isn't great that it is dong that at all.

> Thanks,
> 
> -Ben
> 
> --------------------------
> ….
> TRES                   : cpu=360,node=30,billing=360
> RunTime                : 04:59:14
> GroupId                : eg3(3000)
> ExitCode               : 0:0
> MDT:rename             : 373
> MDT:snapshot_time      : 2018-09-21 08:36:29
> MDT:setattr            : 444
> MDT:mkdir              : 361
> MDT:getattr            : 1570
> MDT:getxattr           : 16608476
> MDT:mknod              : 265
> MDT:rmdir              : 1
> MDT:samedir_rename     : 373
> MDT:close              : 6331
> MDT:unlink             : 113
> MDT:open               : 6345
> OST0009:write_bytes    : 3.46 GB
> OST0008:write_bytes    : 3.11 GB
> OST0001:write_bytes    : 1.01 GB
> OST0000:write_bytes    : 396.19 MB
> OST0005:read_bytes     : 8.19 KB
> OST0005:write_bytes    : 2.38 GB
> OST0005:setattr        : 1
> OST0004:write_bytes    : 790.65 MB
> OST0007:write_bytes    : 3.02 GB
> OST0006:write_bytes    : 817.14 MB
> OST0016:write_bytes    : 4.57 GB
> OST0017:write_bytes    : 5.15 GB
> OST0017:setattr        : 1
> OST0014:write_bytes    : 8.8 GB
> OST0015:write_bytes    : 1.37 GB
> OST0012:write_bytes    : 7 GB
> OST0012:setattr        : 1
> OST0013:read_bytes     : 8.39 MB
> OST0013:write_bytes    : 8.4 GB
> OST0013:setattr        : 1
> OST0010:write_bytes    : 1.98 GB
> OST0011:read_bytes     : 27.28 MB
> OST0011:write_bytes    : 9.42 GB
> OST000c:read_bytes     : 131.07 KB
> OST000c:write_bytes    : 5.83 GB
> OST000c:setattr        : 2
> OST000b:read_bytes     : 28.12 MB
> OST000b:write_bytes    : 4.23 GB
> OST000e:read_bytes     : 8.02 MB
> OST000e:write_bytes    : 7.48 GB
> OST000e:setattr        : 1
> OST000d:write_bytes    : 1.21 GB
> OST000f:write_bytes    : 2.88 GB
> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
---
Andreas Dilger
CTO Whamcloud

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180925/2d6422a2/attachment.sig>