[lustre-devel] Proposal for JobID caching

Tue Feb 28 08:23:30 PST 2017

On 2/16/17, 5:30 PM, "Dilger, Andreas" <andreas.dilger at intel.com> wrote:

>On Feb 16, 2017, at 07:36, Ben Evans <bevans at cray.com> wrote:
>> 
>> 
>> 
>> On 2/7/17, 6:01 PM, "Dilger, Andreas" <andreas.dilger at intel.com> wrote:
>> 
>>> On Feb 2, 2017, at 08:20, Ben Evans <bevans at cray.com> wrote:
>>>> 
>>>> https://review.whamcloud.com/#/c/25208/ is a working version of what I
>>>> had
>>>> proposed, including the suggested changes to default to procname_uid.
>>>> This is not perfect, but the performance is much improved over the
>>>> current
>>>> methods, and unlike inode-based caching Metadata performance isn't
>>>> negatively affected.  Multiple simultaneous jobs can be run on the
>>>>same
>>>> file, and get appropriate metrics.
>>> 
>>> I reviewed the patch, and one question that I had is whether you've
>>>tested
>>> if the JobID is correct when read/write RPCs are generated by
>>>readahead or
>>> ptlrpcd?  That may be more relevant once the async readahead threads
>>>are
>>> implemented by Dmitry.  With an inode-based JobID cache then the JobID
>>>can
>>> (usually) be correctly determined even if the RPC is not generated in
>>>the
>>> context of the user process.
>>> 
>>> I don't think that is necessarily a fault in your patch, but it may be
>>> that
>>> the JobID determination hasn't kept pace with other changes in the
>>>code.
>>> It
>>> would be great if you would verify (possibly with a test attached to
>>>your
>>> patch) that JobID is assigned to all the RPCs that need it.
>> 
>> I've seen some lustre thread names pop into the JobID under the
>> procname_uid scheme when doing something like a dd test.  Filtering them
>> out would be relatively straightforward, and keeping the old JobID (if
>> available) in the lookup table would be the way to get the most reliable
>> info.  There shouldn't be a difference with the current behavior in this
>> regard.
>> 
>> My issue with putting the information in the inode stems from 2 cases,
>>the
>> first is RobinHood, which stats *everything*.  In the proposed solution,
>> one lookup would be done every 30 seconds.  Storing the inode, it would
>> happen for every stat, then never used again.
>> 
>> The other case is less probable, but still out there, in an environment
>> with multiple jobs per node, you may be running two different jobs on
>>the
>> same input set, which would corrupt the counting.
>
>If there are two jobs using the same input files, I suspect the second one
>would get the data from the client cache, and not log anything on the
>server
>at all.  In any case, I don't think that would be any different than the
>two
>jobs are randomly interleaving their access to the same files on the
>server.
>
>Conversely, having "ptlrpcd/0" appear in the jobstats doesn't really help
>anyone figure out which user/job is causing IO traffic on the server.  If
>RPCs generated by ptlrpcd, statahead, and other service threads that do
>work
>on behalf of user processes (including readahead in the near future) have
>the
>proper JobID then that would be much more useful.
>
>Some suggestions on how to handle this, off the top of my head:
>- blacklist service thread PIDs at startup in the JobID hash and have them
>  get the JobID by some other method (e.g. inode, DLM lock/resource,
>other)
>- store the JobID explicitly with the IO request when it is being put into
>  a cache/queue and use this when submitting the RPC if present,
>otherwise get
>  it from the hash
>
>The latter may be preferable, since it doesn't need to do anything for
>sync
>RPCs generated in process context, and avoids an extra lookup when
>processing
>the RPC.  You might consider the first method for debugging when/where
>such
>RPCs are generated, and have the backlisted threads dump a stack once if
>they
>are being looked up in the JobID hash.
>
>Cheers, Andreas

I'm thinking a combination of approaches:  Use the hash as the primary
source, but populate the inode with the data as well and use it when one
of the "reserved" names pops up as the jobID.

For any file access, the open would trigger a JobID lookup, which would
put the correct info into the hash, and then into the inode.  As the JobID
is updated the inode's store would also be updated.

For a lookup, if the table returns ptlrpc, or any other of the Lustre
threads, then the inode cache would be used.

This way, we're doing as few userspace lookups as possible, fixing the
readahead hole that currently exists, and not having an issue with
processes like find or robinhood which touch a lot of files.

-Ben