[lustre-devel] Proposal for JobID caching

Thu Feb 2 07:20:29 PST 2017

https://review.whamcloud.com/#/c/25208/ is a working version of what I had
proposed, including the suggested changes to default to procname_uid.
This is not perfect, but the performance is much improved over the current
methods, and unlike inode-based caching Metadata performance isn't
negatively affected.  Multiple simultaneous jobs can be run on the same
file, and get appropriate metrics.

-Ben

On 1/20/17, 5:00 PM, "Ben Evans" <bevans at cray.com> wrote:

>
>
>On 1/20/17, 4:50 PM, "Dilger, Andreas" <andreas.dilger at intel.com> wrote:
>
>>On Jan 18, 2017, at 13:39, Oleg Drokin <oleg.drokin at intel.com> wrote:
>>> 
>>> 
>>> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
>>> 
>>>> Overview
>>>>            The Lustre filesystem added the ability to track I/O
>>>>performance of a job across a cluster.  The initial algorithm was
>>>>relatively simplistic:  for every I/O, look up the job ID of the
>>>>process and include it in the RPC being sent to the server.  This
>>>>imposed a non-trivial performance impact on client I/O performance.
>>>>            An additional algorithm was introduced to handle the single
>>>>job per node case, where instead of looking up the job ID of the
>>>>process, Lustre simply accesses the value of a variable set through the
>>>>proc interface.  This improved performance greatly, but only functions
>>>>when a single job is being run.
>>>>            A new approach is needed for multiple job per node systems.
>>>> 
>>>> Proposed Solution
>>>>            The proposed solution to this is to create a small
>>>>PID->JobID table in kernel memory.  When a process performs an IO, a
>>>>lookup is done in the table for the PID, if a JobID exists for that
>>>>PID, it is used, otherwise it is retrieved via the same methods as the
>>>>original Jobstats algorithm.  Once located the JobID is stored in a
>>>>PID/JobID table in memory. The existing cfs_hash_table structure and
>>>>functions will be used to implement the table.
>>>> 
>>>> Rationale
>>>>            This reduces the number of calls into userspace, minimizing
>>>>the time taken on each I/O.  It also easily supports multiple job per
>>>>node scenarios, and like other proposed solutions has no issue with
>>>>multiple jobs performing I/O on the same file at the same time.
>>>> 
>>>> Requirements
>>>> ·      Performance cannot significantly detract from baseline
>>>>performance without jobstats
>>>> ·      Supports multiple jobs per node
>>>> ·      Coordination with the scheduler is not required, but interfaces
>>>>may be provided
>>>> ·      Supports multiple PIDs per job
>>>> 
>>>> New Data Structures
>>>>            pid_to_jobid {
>>>>                        struct hlist_node pj_hash;
>>>>                        u54 pj_pid;
>>>>                        char pj_jobid[LUSTRE_JOBID_SIZE];
>>>> spinlock_t jp_lock;
>>>>                        time_t jp_time;
>>>> }
>>>> Proc Variables
>>>> Writing to /proc/fs/lustre/jobid_name while not in ³nodelocal² mode
>>>>will cause all entries in the cache for that jobID to be removed from
>>>>the cache
>>>> 
>>>> Populating the Cache
>>>>            When lustre_get_jobid is called, the process, and in the
>>>>cached mode, first a check will be done in the cache for a valid PID to
>>>>JobID mapping.  If none exists, it uses the same mechanisms to get the
>>>>JobID and populates the appropriate PID to JobID map.
>>>> If a lookup is performed and the PID to JobID mapping exists, but is
>>>>more than 30 seconds old, the JobID is refreshed.
>>>> Purging the Cache
>>>>            The cache can be purged of a specific job by writing the
>>>>JobID to the jobid_name proc file.  Any items in the cache that are
>>>>more than 300 seconds out of date will also be purged at this time.
>>> 
>>> 
>>> I'd much rather prefer you go to the table that's populated outside of
>>>the kernel
>>> somehow.
>>> Let's be realistic, poking around in userspace process environments for
>>>random
>>> strings is not such a great idea at all even though it did look like a
>>>good idea
>>> in the past for simplicity reasons.
>>> Similar to nodelocal, we probably just switch to a method where you
>>>call a
>>> particular lctl command that would mark the whole session as belonging
>>> to some job. This might take several forms, e.g. nodelocal itself could
>>> be extended to only apply to a current namespace/container
>>> But if you do really run different jobs in the global namespace, we
>>>probably can
>>> probably just make the lctl to spawn a shell with commands that all
>>>would
>>> be marked as a particular job? Or we can probably trace the parent of
>>>lctl and
>>> mark that so that all its children become somehow marked too.
>>
>>Having lctl spawn a shell or requiring everything to run in a container
>>is impractical for users, and will just make it harder to use JobID,
>>IMHO.  The job scheduler is _already_ storing the JobID in the process
>>environment so that it is available to all of the threads running as part
>>of the job.  The question is how the job prolog script can communicate
>>the JobID directly to Lustre without using a global /proc file?  Doing an
>>upcall to userspace per JobID lookup is going to be *worse* for
>>performance than the current searching through the process environment.
>>
>>I'm not against Ben's proposal to implement a cache in the kernel for
>>different processes.  It is unfortunate that we can't have proper
>>thread-local storage for Lustre, so a hash table is probably reasonable
>>for this (there may be thousands of threads involved).  I don't think the
>>cl_env struct would be useful, since it is not tied to a specific thread
>>(AFAIK), but rather assigned as different threads enter/exit kernel
>>context.  Note that we already have similar time-limited caches for the
>>identity upcall and FMD (lustre/ofd/ofd_fmd.c), so it may be useful to
>>see whether the code can be shared.
>
>I'll take a look at those, but implementing the hash table was a pretty
>simple solution, I need to work out a few kinks with memory leaks before
>doing real performance tests on it to make sure it performs similarly to
>nodelocal.
>
>>Another (not very nice) option to avoid looking through the environment
>>variables (which IMHO isn't so bad, even though the upstream folks don't
>>like it) is to associate the JobID set via /proc with a process group
>>internally and look the PGID up in the kernel to find the JobID.  That
>>can be repeated each time a new JobID is set via /proc, since the PGID
>>would stick around for each new job/shell/process created under the PGID.
>> It won't be as robust as looking up the JobID in the environment, but
>>probably good enough for most uses.
>>
>>I would definitely also be in favor of having some way to fall back to
>>procname_uid if the PGID cannot be found, the job environment variable is
>>not available, and there is nothing in nodelocal.
>
>That's simple enough.
>