[lustre-devel] Proposal for JobID caching
Ben Evans
bevans at cray.com
Thu Jan 19 07:19:35 PST 2017
On 1/18/17, 5:56 PM, "Oleg Drokin" <oleg.drokin at intel.com> wrote:
>
>On Jan 18, 2017, at 5:35 PM, Ben Evans wrote:
>
>>
>>
>> On 1/18/17, 3:39 PM, "Oleg Drokin" <oleg.drokin at intel.com> wrote:
>>
>>>
>>> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
>>>
>>>> Overview
>>>> The Lustre filesystem added the ability to track I/O
>>>> performance of a job across a cluster. The initial algorithm was
>>>> relatively simplistic: for every I/O, look up the job ID of the
>>>>process
>>>> and include it in the RPC being sent to the server. This imposed a
>>>> non-trivial performance impact on client I/O performance.
>>>> An additional algorithm was introduced to handle the single
>>>> job per node case, where instead of looking up the job ID of the
>>>> process, Lustre simply accesses the value of a variable set through
>>>>the
>>>> proc interface. This improved performance greatly, but only functions
>>>> when a single job is being run.
>>>> A new approach is needed for multiple job per node systems.
>>>>
>>>> Proposed Solution
>>>> The proposed solution to this is to create a small
>>>> PID->JobID table in kernel memory. When a process performs an IO, a
>>>> lookup is done in the table for the PID, if a JobID exists for that
>>>>PID,
>>>> it is used, otherwise it is retrieved via the same methods as the
>>>> original Jobstats algorithm. Once located the JobID is stored in a
>>>> PID/JobID table in memory. The existing cfs_hash_table structure and
>>>> functions will be used to implement the table.
>>>>
>>>> Rationale
>>>> This reduces the number of calls into userspace, minimizing
>>>> the time taken on each I/O. It also easily supports multiple job per
>>>> node scenarios, and like other proposed solutions has no issue with
>>>> multiple jobs performing I/O on the same file at the same time.
>>>>
>>>> Requirements
>>>> · Performance cannot significantly detract from baseline
>>>> performance without jobstats
>>>> · Supports multiple jobs per node
>>>> · Coordination with the scheduler is not required, but interfaces
>>>> may be provided
>>>> · Supports multiple PIDs per job
>>>>
>>>> New Data Structures
>>>> pid_to_jobid {
>>>> struct hlist_node pj_hash;
>>>> u54 pj_pid;
>>>> char pj_jobid[LUSTRE_JOBID_SIZE];
>>>> spinlock_t jp_lock;
>>>> time_t jp_time;
>>>> }
>>>> Proc Variables
>>>> Writing to /proc/fs/lustre/jobid_name while not in ³nodelocal² mode
>>>> will cause all entries in the cache for that jobID to be removed from
>>>> the cache
>>>>
>>>> Populating the Cache
>>>> When lustre_get_jobid is called, the process, and in the
>>>> cached mode, first a check will be done in the cache for a valid PID
>>>>to
>>>> JobID mapping. If none exists, it uses the same mechanisms to get the
>>>> JobID and populates the appropriate PID to JobID map.
>>>> If a lookup is performed and the PID to JobID mapping exists, but is
>>>> more than 30 seconds old, the JobID is refreshed.
>>>> Purging the Cache
>>>> The cache can be purged of a specific job by writing the
>>>> JobID to the jobid_name proc file. Any items in the cache that are
>>>>more
>>>> than 300 seconds out of date will also be purged at this time.
>>>
>>>
>>> I'd much rather prefer you go to the table that's populated outside of
>>> the kernel
>>> somehow.
>>> Let's be realistic, poking around in userspace process environments for
>>> random
>>> strings is not such a great idea at all even though it did look like a
>>> good idea
>>> in the past for simplicity reasons.
>>
>> On the upside, there's far less of that going on now, since the results
>> are cached via pid. I'm unaware of a table that exists in userspace
>>that
>> maps PIDs to Jobs.
>
>there is not.
>
>>> Similar to nodelocal, we probably just switch to a method where you
>>>call a
>>> particular lctl command that would mark the whole session as belonging
>>> to some job. This might take several forms, e.g. nodelocal itself could
>>> be extended to only apply to a current namespace/container
>>
>> That would make sense, but would need to requirement that each job has
>> it's own namespace/container.
>
>Only if you run multiple jobs per node at the same time,
>otherwise just do the nodelocal for hte global root namespace.
Agreed, this is supposed to handle the multiple jobs per node case.
>
>>> But if you do really run different jobs in the global namespace, we
>>> probably can
>>> probably just make the lctl to spawn a shell with commands that all
>>>would
>>> be marked as a particular job? Or we can probably trace the parent of
>>> lctl and
>>> mark that so that all its children become somehow marked too.
>>
>> One of the things that came up during this is how do you handle a random
>> user who logs into a compute node and runs something like rsync? The
>>more
>
>Current scheme does not handle it either, unles you use nodelocal and
>then their
>actions would attribute to the job currently running (not super ideal as
>well),
>I imagine there's a legitimate reason for users to log into the nodes
>running
>unrelated jobs?
The current scheme does handle it, if you use the procname_uid setting.
>
>> conditions we place around getting jobstats to function properly, the
>> harder these types of behaviors are to track down. One thing I was
>> thinking was that if jobstats is enabled, that the fallback if no JobID
>> can be found is to simply use the taskname_uid method, so an admin would
>> see rsync.1234 pop up on your monitoring dashboard.
>
>If you have every node into its own container, then the global namespace
>could
>be set to "unscheduledcommand-$hostname" or some such and every container
>would get its own jobid.
or simply default to the existing procname_uid setting.
>
>This does require containers of course. Or if we set the id based on the
>process group,
>then again they would get that and anything outside would get something
>default helping you.
>
More information about the lustre-devel
mailing list