[lustre-discuss] Jobstats harvesting

Fri Feb 14 18:13:19 PST 2020

Hi folks,

I've finally got round to enabling jobstats on a test system. As we're
a Slurm shop, setting this to jobid_var=SLURM_JOB_ID works OK, but is
it possible to use a combination of variables?
ie ${PAWSEY_CLUSTER}-${SLURM_JOB_ID} (or even SLURM_CLUSTER_NAME which
is the same as $PAWSEY_CLUSTER)? if so, what's the syntax? (Yes, I
know that setting it to federated would jump up the JobId namespace to
include a cluster identifier, but that's not happening for now.

However, main reason for mail is to find out what people use to
harvest the stats off the MDT/OSTs - I'm aware of Roland Laifer's
LAD15 presentation (sadly his tarball misses a sample config file out,
so it's taken me a bit of iteration over the Perl scripts to recreate
syntax) which saves to a file based structure, and I've seen others
using Prometheus (via https://grafana.com/grafana/dashboards/9671)

We've got influxdb (lnet / mds / ost stats gathered as well as regular
collectd output) and mariaDB (slurmdbd and robinhood) DBs available,
so I'd rather go with something that fed into that.
We're not doing serious high throughput (financial style) but more
traditional HPC with a lot (sigh) of single node jobs over 4
production filesystems (of which 3 are non-appliance LTS releases
maintained by us)

Hopefully the discussion here will lead to some updated content at
http://wiki.lustre.org/Lustre_Monitoring_and_Statistics_Guide (hat tip
to Scott for a great start)

Many thanks

Andrew