[lustre-discuss] jobstats, SLURM_JOB_ID, array jobs and pain.

Scott Nolin scott.nolin at ssec.wisc.edu
Thu Apr 30 15:02:16 PDT 2015


Has anyone been working with the lustre jobstats feature and SLURM? We 
have been, and it's OK. But now that I'm working on systems that run a 
lot of array jobs and a fairly recent slurm version we found some ugly 
stuff.

Array jobs report their do SLURM_JOBID as a variable, and it's unique 
for every job. But they use other IDs too that appear only for array jobs.

http://slurm.schedmd.com/job_array.html

However, that unique SLURM_JOBID as far as I can tell is only truly 
exposed in command line tools via 'scontrol' - which is only valid while 
the job is running. If you want to look at older jobs with sacct for 
example, things are troublesome.

Here's what my coworker and I have figured out:

- You submit a (non-array) job that gets jobid 100.
- The next job gets jobid 101.
- Then submit a 10 task array job. That gets jobid 102. The sub tasks 
get 9 more job ids. If nothing else is happening with the system, that 
means you use jobid 102 to 112.

If things were that orderly, you could cope with using SLURM_JOB_ID in 
lustre jobstats pretty easily. Use sacct and you see job 102_2 - you 
know that is jobid 103 in lustre jobstats.

But, if other jobs get submitted during set up (as of course they do), 
they can take jobid 103. So, you've got problems.

I think we may try to set a magic variable in the slurm prolog and use 
that for the jobstats_var, but who knows.

Scott

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6248 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20150430/b27a5b5f/attachment.bin>


More information about the lustre-discuss mailing list