[lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support

Thu Jul 5 20:11:42 PDT 2018

On Fri, Jul 06 2018, James Simmons wrote:

>> NeilBrown [mailto:neilb at suse.com] wrote:
>> 
>> To help contextualize things: the Lustre code can be decomposed into three parts:
>> 
>> 1) The filesystem proper: Lustre.
>> 2) The communication protocol it uses: LNet.
>> 3) Supporting code used by Lustre and LNet: CFS.
>> 
>> Part of the supporting code is the CPT mechanism, which provides a way to
>> partition the CPUs of a system. These partitions are used to distribute queues,
>> locks, and threads across the system. It was originally introduced years ago, as
>> far as I can tell mainly to deal with certain hot locks: these were converted into
>> read/write locks with one spinlock per CPT.
>> 
>> As a general rule, CPT boundaries should respect node and socket boundaries,
>> but at the higher end, where CPUs have 20+ cores, it may make sense to split
>> a CPUs cores across several CPTs.
>> 
>> > Thanks everyone for your patience in explaining things to me.
>> > I'm beginning to understand what to look for and where to find it.
>> > 
>> > So the answers to Greg's questions:
>> > 
>> >   Where are you reading the host memory NUMA information from?
>> > 
>> >   And why would a filesystem care about this type of thing?  Are you
>> >   going to now mirror what the scheduler does with regards to NUMA
>> >   topology issues?  How are you going to handle things when the topology
>> >   changes?  What systems did you test this on?  What performance
>> >   improvements were seen?  What downsides are there with all of this?
>> > 
>> > 
>> > Are:
>> 
>> >   - NUMA info comes from ACPI or device-tree just like for every one
>> >       else.  Lustre just uses node_distance().
>> 
>> Correct, the standard kernel interfaces for this information are used to
>> obtain it, so ultimately Lustre/LNet uses the same source of truth as
>> everyone else.
>> 
>> >   - The filesystem cares about this because...  It has service
>> >     thread that does part of the work of some filesystem operations
>> >     (handling replies for example) and these are best handled "near"
>> >     the CPU the initiated the request.  Lustre partitions
>> >     all CPUs into "partitions" (cpt) each with a few cores.
>> >     If the request thread and the reply thread are on different
>> >     CPUs but in the same partition, then we get best throughput
>> >     (is that close?)
>> 
>> At the filesystem level, it does indeed seem to help to have the service
>> threads that do work for requests run on a different core that is close to
>> the core that originated the request. So preferably on the same CPU, and
>> on certain multi-core CPUs there are also distance effects between cores.
>> That too is one of the things the CPT mechanism handles.
>
> Their is another very important aspect to why Lustre has a CPU partition 
> layer. At least at the place I work at. While the Linux kernel manages all
> the NUMA nodes and CPU cores Lustre adds the ability for us to specify a 
> subset of everything on the system. The reason is to limit the impact of
> noise on the compute nodes. Noise has a heavy impact on large scale HP
> work loads that can run days or even weeks at a time. Lets take an 
> example system:
>
>                |-------------|     |-------------|
>    |-------|   | NUMA  0     |     | NUMA  1     |   |-------|
>    | eth0  | - |             | --- |             | - | eth1  |      
>    |_______|   | CPU0  CPU1  |     | CPU2  CPU3  |   |_______|
>                |_____________|     |_____________|
>
> In such a system it is possible with the right job schedular to start a 
> large parallel application on NUMA 0/ (CPU0 and CPU1). Normally such
> large parallel applications will communicate between nodes using MPI,
> such as openmpi, which can be configured to use eth0 only. Using the
> CPT layer in lustre we can isolate lustre to NUMA 1 and use only eth1.
> This greatly reducess the noise impact on the application running.
>
> BTW this is one of the reasons ko2iblnd for lustre doesn't use the
> generic RDMA api. The core IB layer doesn't support such isolation.
> At least to my knowledge.

Thanks for that background (and for the separate explanation of how
jitter multiplies when jobs needs to synchronize periodically).

I can see that setting CPU affinity for lustre/lnet worker threads could
be important, and that it can be valuable to tie services to a
particular interface.  I cannot yet see why we need partitions for this,
rather that doing it at the CPU (or NODE) level.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180706/b207f4ab/attachment.sig>