[lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support

Thu Jul 5 23:13:31 PDT 2018

On Fri, Jul 06 2018, Doug Oucharek wrote:

> When the CPT code was added to LNet back in 2012, it was to address
> one primary case: a need for finer grained locking on metadata
> servers.  LNet used to have global locks and metadata servers, which
> do many small messages (high IOPS), much time in the worker threads
> was spent in spinlocks.  So, CPT configuration was added so
> locks/resources could be allocated per CPT.  This way, users have
> control over how they want CPTs to be configured and how they want
> resources/locks to be divided.  For example, users may want finer
> grained locking on the metadata servers but not on clients.  Leaving
> this to be automatically configured by Linux API calls would take this
> flexibility away from the users who, for HPC, are very knowledgable
> about what they want (i.e. we do not want to protect them from
> themselves).
>
> The CPT support in LNet and LNDs has morphed to encompass more
> traditional NUMA and core affinity performance improvements.  For
> example, you can restrict a network interface to a socket (NUMA node)
> which has better affinity to the PCIe lanes that interface is
> connected to.  Rather than try to do this sort of thing automatically,
> we have left it to the user to know what they are doing and configure
> the CPTs accordingly.
>
> I think the many changes to the CPT code has realty clouded its
> purpose.  In summary, the original purpose was finer grained locking
> and that needs to be maintained as the IOPS requirements of metadata
> servers is paramount.

Thanks for the explanation.
I definitely get that fine-grained locking is a good thing.  Lustre is
not alone in this of course.
Even better than fine-grained locking is no locking.  That is not often
possible, but this
  https://github.com/neilbrown/linux/commit/ac3f8fd6e61b245fa9c14e3164203c1211c5ef6b

is an example of doing exactly that.

For the read/writer usage of CPT locks, RCU is a better approach if it
can be made to work (usually it can) - and it scales even better.

When I was digging through the usage of locks I saw some hash tables.
It seems that a lock protected a whole table.  It is usually sufficient
for the lock to just protect a single chain (bit spin-locks can easily
store one lock per chain) and then only for writes - RCU discipline can
allow reads to proceed with only rcu_read_lock().
Would we still need per-CPT tables once that was in place?  I don't know
yet, though per-node seems likely to be sufficient when locking is per-chain.

I certainly wouldn't discard CPTs without replacing them with something
better.  Near the top of my list for when I return from vacation
(leaving in a couple of days) will be to look closely at the current
fine-grained locking that you have helped me to see more clearly, and
see if I can make it even better.

Thanks,
NeilBrown

>
> James: The Verbs RDMA interface has very poor support for NUMA/core affinity.  I was going to try to devise some patches to address that but have been too busy on other things.  Perhaps the RDMA maintainer could consider updating it?
>
> Doug
>
> On Jul 5, 2018, at 8:11 PM, NeilBrown <neilb at suse.com<mailto:neilb at suse.com>> wrote:
>
> On Fri, Jul 06 2018, James Simmons wrote:
>
> NeilBrown [mailto:neilb at suse.com] wrote:
>
> To help contextualize things: the Lustre code can be decomposed into three parts:
>
> 1) The filesystem proper: Lustre.
> 2) The communication protocol it uses: LNet.
> 3) Supporting code used by Lustre and LNet: CFS.
>
> Part of the supporting code is the CPT mechanism, which provides a way to
> partition the CPUs of a system. These partitions are used to distribute queues,
> locks, and threads across the system. It was originally introduced years ago, as
> far as I can tell mainly to deal with certain hot locks: these were converted into
> read/write locks with one spinlock per CPT.
>
> As a general rule, CPT boundaries should respect node and socket boundaries,
> but at the higher end, where CPUs have 20+ cores, it may make sense to split
> a CPUs cores across several CPTs.
>
> Thanks everyone for your patience in explaining things to me.
> I'm beginning to understand what to look for and where to find it.
>
> So the answers to Greg's questions:
>
>  Where are you reading the host memory NUMA information from?
>
>  And why would a filesystem care about this type of thing?  Are you
>  going to now mirror what the scheduler does with regards to NUMA
>  topology issues?  How are you going to handle things when the topology
>  changes?  What systems did you test this on?  What performance
>  improvements were seen?  What downsides are there with all of this?
>
>
> Are:
>
>  - NUMA info comes from ACPI or device-tree just like for every one
>      else.  Lustre just uses node_distance().
>
> Correct, the standard kernel interfaces for this information are used to
> obtain it, so ultimately Lustre/LNet uses the same source of truth as
> everyone else.
>
>  - The filesystem cares about this because...  It has service
>    thread that does part of the work of some filesystem operations
>    (handling replies for example) and these are best handled "near"
>    the CPU the initiated the request.  Lustre partitions
>    all CPUs into "partitions" (cpt) each with a few cores.
>    If the request thread and the reply thread are on different
>    CPUs but in the same partition, then we get best throughput
>    (is that close?)
>
> At the filesystem level, it does indeed seem to help to have the service
> threads that do work for requests run on a different core that is close to
> the core that originated the request. So preferably on the same CPU, and
> on certain multi-core CPUs there are also distance effects between cores.
> That too is one of the things the CPT mechanism handles.
>
> Their is another very important aspect to why Lustre has a CPU partition
> layer. At least at the place I work at. While the Linux kernel manages all
> the NUMA nodes and CPU cores Lustre adds the ability for us to specify a
> subset of everything on the system. The reason is to limit the impact of
> noise on the compute nodes. Noise has a heavy impact on large scale HP
> work loads that can run days or even weeks at a time. Lets take an
> example system:
>
>               |-------------|     |-------------|
>   |-------|   | NUMA  0     |     | NUMA  1     |   |-------|
>   | eth0  | - |             | --- |             | - | eth1  |
>   |_______|   | CPU0  CPU1  |     | CPU2  CPU3  |   |_______|
>               |_____________|     |_____________|
>
> In such a system it is possible with the right job schedular to start a
> large parallel application on NUMA 0/ (CPU0 and CPU1). Normally such
> large parallel applications will communicate between nodes using MPI,
> such as openmpi, which can be configured to use eth0 only. Using the
> CPT layer in lustre we can isolate lustre to NUMA 1 and use only eth1.
> This greatly reducess the noise impact on the application running.
>
> BTW this is one of the reasons ko2iblnd for lustre doesn't use the
> generic RDMA api. The core IB layer doesn't support such isolation.
> At least to my knowledge.
>
> Thanks for that background (and for the separate explanation of how
> jitter multiplies when jobs needs to synchronize periodically).
>
> I can see that setting CPU affinity for lustre/lnet worker threads could
> be important, and that it can be valuable to tie services to a
> particular interface.  I cannot yet see why we need partitions for this,
> rather that doing it at the CPU (or NODE) level.
>
> Thanks,
> NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180706/e1e273c6/attachment.sig>