[lustre-discuss] jobstats, SLURM_JOB_ID, array jobs and pain.

Drokin, Oleg oleg.drokin at intel.com
Fri May 8 20:32:56 PDT 2015


Hello!

On Apr 30, 2015, at 6:02 PM, Scott Nolin wrote:

> Has anyone been working with the lustre jobstats feature and SLURM? We have been, and it's OK. But now that I'm working on systems that run a lot of array jobs and a fairly recent slurm version we found some ugly stuff.
> 
> Array jobs report their do SLURM_JOBID as a variable, and it's unique for every job. But they use other IDs too that appear only for array jobs.
> 
> http://slurm.schedmd.com/job_array.html
> 
> However, that unique SLURM_JOBID as far as I can tell is only truly exposed in command line tools via 'scontrol' - which is only valid while the job is running. If you want to look at older jobs with sacct for example, things are troublesome.
> 
> Here's what my coworker and I have figured out:
> 
> - You submit a (non-array) job that gets jobid 100.
> - The next job gets jobid 101.
> - Then submit a 10 task array job. That gets jobid 102. The sub tasks get 9 more job ids. If nothing else is happening with the system, that means you use jobid 102 to 112.
> 
> If things were that orderly, you could cope with using SLURM_JOB_ID in lustre jobstats pretty easily. Use sacct and you see job 102_2 - you know that is jobid 103 in lustre jobstats.
> 
> But, if other jobs get submitted during set up (as of course they do), they can take jobid 103. So, you've got problems.
> 
> I think we may try to set a magic variable in the slurm prolog and use that for the jobstats_var, but who knows.

There's another method planned for doing jobid stuff, now mainly featured in kernel staging tree, but will make it's way to lustre tree too.

It's to just write your jobid directly into lustre from your prologue script (and clear from epilogue).

That way you can set it to whatever you like without ugly messings with shell variables (and equally ugly parsing of those variables from the kernel!).

For some reason I cannot find the corresponding master patch, though I have a passing memory of writing it, so this needs to be addressed separately.

Bye,
    Oleg


More information about the lustre-discuss mailing list