[lustre-devel] [PATCH v2 33/33] lustre: update version to 2.9.99

Wed Jan 9 15:16:19 PST 2019

On Jan 9, 2019, at 11:28, James Simmons <jsimmons at infradead.org> wrote:
> 
> 
>>>> This might be because the upstream Lustre doesn't allow setting per-process
>>>> JobID via environment variable, only as a single per-node value.  The real
>>>> unfortunate part is that the "get JobID from environment" actually works for
>>>> every reasonable architecture (even the one which was originally broken
>>>> fixed it), but it got yanked anyway.  This is actually one of the features
>>>> of Lustre that lots of HPC sites like to use, since it allows them to track
>>>> on the servers which users/jobs/processes on the client are doing IO.
>>> 
>>> To give background for Neil see thread:
>>> 
>>> https://lore.kernel.org/patchwork/patch/416846
>>> 
>>> In this case I do agree with Greg. The latest jobid does implement an
>>> upcall and upcalls don't play niece with containers. Their is also the
>>> namespace issue pointed out. I think the namespace issue might be fixed
>>> in the latest OpenSFS code.
>> 
>> I'm not sure what you mean?  AFAIK, there is no upcall for JobID, except
>> maybe in the kernel client where we weren't allowed to parse the process
>> environment directly.  I agree an upcall is problematic with namespaces,
>> in addition to being less functional (only a JobID per node instead of
>> per process), which is why direct access to JOBENV is better IMHO.
> 
> I have some evil ideas about this. Need to think about it some more since
> this is a more complex problem.

Since the kernel manages the environment variables via getenv() and setenv(),
I honestly don't see why accessing them directly is a huge issue?

>>> The whole approach to stats in lustre is
>>> pretty awful. Take jobstats for example. Currently the approach is
>>> to poll inside the kernel at specific intervals. Part of the polling is 
>>> scanning the running processes environment space. On top of this the 
>>> administor ends up creating scripts to poll the proc / debugfs entry. 
>>> Other types of lustre stat files take a similar approach. Scripts have
>>> to poll debugfs / procfs entries.
>> 
>> I think that issue is orthogonal to getting the actual JobID.  That is
>> the stats collection from the kernel.  We shouldn't be inventing a new
>> way to process that.  What does "top" do?  Read a thousand /proc files
>> every second because that is flexible for different use cases.  There
>> are much fewer Lustre stats files on a given node, and I haven't heard
>> that the actual stats reading interface is a performance issue.
> 
> Because the policy for the linux kernel is not to add non processes 
> related information in procfs anymore. "top" reads process information
> from procfs which is okay.

The location of the stats (procfs vs. sysfs vs. debugfs) wasn't my point.
My point was that a *very* core kernel performance monitoring utility is
doing open/read/close from virtual kernel files, so before we go ahead
and invent our own performance monitoring framework (which may be frowned
upon by upstream for arbitrary reasons because it isn't using /proc or
/sys files).

> This means the stats lustre generates are
> required to be placed in debugfs. The problem their is you need to be
> root to access this information.

That is a self-inflicted problem because of upstream kernel policy to
move the existing files out of /proc and being unable to use /sys either.

> I told the administrator about this
> and they told me in no way will they run an application as root just
> to read stats. We really don't want to require users to mount their
> debugfs partitions to allow non root uses to access it. So I looked 
> into alteranatives. Actually with netlink you have far more power for 
> handling stats then polling some proc file. Also while for most cases the 
> stat files are not huge in general but if we do end up having a stat 
> seq_file with a huge amount of data then polling that file can really 
> spike the load on an node.

I agree it is not ideal.  One option (AFAIK) would be a udev rule that
changes the /sys/kernel/debug/lustre/* files to be at readable by a
non-root group (e.g. admin or perftools or whatever) for the collector.

>>> I have been thinking what would be a better approach since I like to
>>> approach this problem for the 2.13 time frame. Our admins at my work
>>> place want to be able to collect application stats without being root.
>>> So placing stats in debugfs is not an option, which we currently do
>>> the linux client :-( The stats are not a good fit for sysfs. The solution 
>>> I have been pondering is using netlink. Since netlink is socket based it 
>>> can be treated as a pipe. Now you are thinking well you still need to poll 
>>> on the netlink socket but you don't have too. systemd does it for you :-)  
>>> We can create systemd service file which uses
>> 
>> For the love of all that is holy, do not make Lustre stats usage depend
>> on Systemd to be usable.
> 
> I never write code that locks in one approach ever. Take for example the
> lctl conf_param / set_param -P handling with the move to sysfs. Instead
> of the old upcall method to lctl now we have a udev rule. That rule is not
> law!!! A site could create their own udev rule if they want to say log
> changes to the lustre tunables. Keep in mind udev rules need to be simple
> since they block until completed much like upcalls do. If you want to
> run a heavy application you can create a system.d service to handle the
> tunable uevent. If you are really clever you can use dbus to send that
> the tuning event to external nodes. Their are many creative options.
> 
> The same is true with the stats netlink approach. I was up late last night
> pondering a design for the netlink stats. I have to put together a list
> of my ideas and run it by my admins. So no systemd is not a hard 
> requirment. Just an option for people into that sort of thing. Using
> udev and netlink opens up a whole new stack to take advantage of.

Sorry to be negative, but I was just having fun with systemd over the
weekend on one of my home systems, and I really don't want to entangle
it into our stats.  If the existing procfs/sysfs/debugfs scraping will
continue to work in the future then I'm fine with that.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud