[lustre-discuss] Invalid jobid size

Sternberg, Michael G. sternberg at anl.gov
Fri Aug 12 15:26:55 PDT 2022


Einar,

The strings in your $SLURM_JOB_ID values or host names are likely too long to serve as jobid for the Lustre Jobstats feature .

You might try %H instead of %h in jobid_name. For reference, from the Lustre manual, https://doc.lustre.org/lustre_manual.xhtml#jobstats :

> %e print executable name
> %g print group ID number
> %h print fully-qualified hostname
> %H print short hostname
> %j print JobID from process environment variable named by the jobid_var parameter
> %p print numeric process ID
> %u print user ID number


On my system (2.12), I use:

	jobid_var=PBS_JOBID
	jobid_name=%e.%u

I get job_stats by $PBS_JOBID, as expected, from processes that actually have the variable set, and synthetic %e.%u values from all others, like processes on interactive or backup nodes. This has been working just fine to pinpoint the source of occasional trouble.

Curiously, I don't think the manual spells out what happens when the variable referenced by jobid_var is unset, i.e., the above fallback logic from jobid_var to jobid_name.


With best regards,
-- 
Michael Sternberg, Ph.D.
Principal Scientific Computing Administrator
Center for Nanoscale Materials
Argonne National Laboratory




> On Aug 12, 2022, at 03:37, Einar Næss Jensen <einar.nass.jensen at ntnu.no> wrote:
> logfiles on oss servers are full of these error messages:
> Invalid jobid size (37), expect(32)
> What does it mean?
> 
> we have set this:
> [root at mds-1 ~]# lctl get_param jobid_var jobid_name
> jobid_var=SLURM_JOB_ID
> jobid_name=%j:%u:%h
> 
> lustre version is 2.12.6(ddn)


More information about the lustre-discuss mailing list