[lustre-devel] Proposal for JobID caching

Tue Feb 28 13:17:54 PST 2017

On Feb 28, 2017, at 09:23, Ben Evans <bevans at cray.com> wrote:
> 
> 
> 
> On 2/16/17, 5:30 PM, "Dilger, Andreas" <andreas.dilger at intel.com> wrote:
> 
>> On Feb 16, 2017, at 07:36, Ben Evans <bevans at cray.com> wrote:
>>> 
>>> 
>>> 
>>> On 2/7/17, 6:01 PM, "Dilger, Andreas" <andreas.dilger at intel.com> wrote:
>>> 
>>>> On Feb 2, 2017, at 08:20, Ben Evans <bevans at cray.com> wrote:
>>>>> 
>>>>> https://review.whamcloud.com/#/c/25208/ is a working version of what I
>>>>> had
>>>>> proposed, including the suggested changes to default to procname_uid.
>>>>> This is not perfect, but the performance is much improved over the
>>>>> current
>>>>> methods, and unlike inode-based caching Metadata performance isn't
>>>>> negatively affected.  Multiple simultaneous jobs can be run on the
>>>>> same
>>>>> file, and get appropriate metrics.
>>>> 
>>>> I reviewed the patch, and one question that I had is whether you've
>>>> tested
>>>> if the JobID is correct when read/write RPCs are generated by
>>>> readahead or
>>>> ptlrpcd?  That may be more relevant once the async readahead threads
>>>> are
>>>> implemented by Dmitry.  With an inode-based JobID cache then the JobID
>>>> can
>>>> (usually) be correctly determined even if the RPC is not generated in
>>>> the
>>>> context of the user process.
>>>> 
>>>> I don't think that is necessarily a fault in your patch, but it may be
>>>> that
>>>> the JobID determination hasn't kept pace with other changes in the
>>>> code.
>>>> It
>>>> would be great if you would verify (possibly with a test attached to
>>>> your
>>>> patch) that JobID is assigned to all the RPCs that need it.
>>> 
>>> I've seen some lustre thread names pop into the JobID under the
>>> procname_uid scheme when doing something like a dd test.  Filtering them
>>> out would be relatively straightforward, and keeping the old JobID (if
>>> available) in the lookup table would be the way to get the most reliable
>>> info.  There shouldn't be a difference with the current behavior in this
>>> regard.
>>> 
>>> My issue with putting the information in the inode stems from 2 cases,
>>> the
>>> first is RobinHood, which stats *everything*.  In the proposed solution,
>>> one lookup would be done every 30 seconds.  Storing the inode, it would
>>> happen for every stat, then never used again.
>>> 
>>> The other case is less probable, but still out there, in an environment
>>> with multiple jobs per node, you may be running two different jobs on
>>> the
>>> same input set, which would corrupt the counting.
>> 
>> If there are two jobs using the same input files, I suspect the second one
>> would get the data from the client cache, and not log anything on the
>> server
>> at all.  In any case, I don't think that would be any different than the
>> two
>> jobs are randomly interleaving their access to the same files on the
>> server.
>> 
>> Conversely, having "ptlrpcd/0" appear in the jobstats doesn't really help
>> anyone figure out which user/job is causing IO traffic on the server.  If
>> RPCs generated by ptlrpcd, statahead, and other service threads that do
>> work
>> on behalf of user processes (including readahead in the near future) have
>> the
>> proper JobID then that would be much more useful.
>> 
>> Some suggestions on how to handle this, off the top of my head:
>> - blacklist service thread PIDs at startup in the JobID hash and have them
>> get the JobID by some other method (e.g. inode, DLM lock/resource,
>> other)
>> - store the JobID explicitly with the IO request when it is being put into
>> a cache/queue and use this when submitting the RPC if present,
>> otherwise get
>> it from the hash
>> 
>> The latter may be preferable, since it doesn't need to do anything for
>> sync
>> RPCs generated in process context, and avoids an extra lookup when
>> processing
>> the RPC.  You might consider the first method for debugging when/where
>> such
>> RPCs are generated, and have the backlisted threads dump a stack once if
>> they
>> are being looked up in the JobID hash.
>> 
>> Cheers, Andreas
> 
> I'm thinking a combination of approaches:  Use the hash as the primary
> source, but populate the inode with the data as well and use it when one
> of the "reserved" names pops up as the jobID.
> 
> For any file access, the open would trigger a JobID lookup, which would
> put the correct info into the hash, and then into the inode.  As the JobID
> is updated the inode's store would also be updated.
> 
> For a lookup, if the table returns ptlrpc, or any other of the Lustre
> threads, then the inode cache would be used.
> 
> This way, we're doing as few userspace lookups as possible, fixing the
> readahead hole that currently exists, and not having an issue with
> processes like find or robinhood which touch a lot of files.

Yes, this sounds the same as what I was thinking.  It should be possible to
"blacklist" the client threads (ptlrpcd, statahead, ll_ping, wherever we use
kthread_run() on the client).

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation