[lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
NeilBrown
neilb at suse.com
Thu Jul 5 20:11:42 PDT 2018
On Fri, Jul 06 2018, James Simmons wrote:
>> NeilBrown [mailto:neilb at suse.com] wrote:
>>
>> To help contextualize things: the Lustre code can be decomposed into three parts:
>>
>> 1) The filesystem proper: Lustre.
>> 2) The communication protocol it uses: LNet.
>> 3) Supporting code used by Lustre and LNet: CFS.
>>
>> Part of the supporting code is the CPT mechanism, which provides a way to
>> partition the CPUs of a system. These partitions are used to distribute queues,
>> locks, and threads across the system. It was originally introduced years ago, as
>> far as I can tell mainly to deal with certain hot locks: these were converted into
>> read/write locks with one spinlock per CPT.
>>
>> As a general rule, CPT boundaries should respect node and socket boundaries,
>> but at the higher end, where CPUs have 20+ cores, it may make sense to split
>> a CPUs cores across several CPTs.
>>
>> > Thanks everyone for your patience in explaining things to me.
>> > I'm beginning to understand what to look for and where to find it.
>> >
>> > So the answers to Greg's questions:
>> >
>> > Where are you reading the host memory NUMA information from?
>> >
>> > And why would a filesystem care about this type of thing? Are you
>> > going to now mirror what the scheduler does with regards to NUMA
>> > topology issues? How are you going to handle things when the topology
>> > changes? What systems did you test this on? What performance
>> > improvements were seen? What downsides are there with all of this?
>> >
>> >
>> > Are:
>>
>> > - NUMA info comes from ACPI or device-tree just like for every one
>> > else. Lustre just uses node_distance().
>>
>> Correct, the standard kernel interfaces for this information are used to
>> obtain it, so ultimately Lustre/LNet uses the same source of truth as
>> everyone else.
>>
>> > - The filesystem cares about this because... It has service
>> > thread that does part of the work of some filesystem operations
>> > (handling replies for example) and these are best handled "near"
>> > the CPU the initiated the request. Lustre partitions
>> > all CPUs into "partitions" (cpt) each with a few cores.
>> > If the request thread and the reply thread are on different
>> > CPUs but in the same partition, then we get best throughput
>> > (is that close?)
>>
>> At the filesystem level, it does indeed seem to help to have the service
>> threads that do work for requests run on a different core that is close to
>> the core that originated the request. So preferably on the same CPU, and
>> on certain multi-core CPUs there are also distance effects between cores.
>> That too is one of the things the CPT mechanism handles.
>
> Their is another very important aspect to why Lustre has a CPU partition
> layer. At least at the place I work at. While the Linux kernel manages all
> the NUMA nodes and CPU cores Lustre adds the ability for us to specify a
> subset of everything on the system. The reason is to limit the impact of
> noise on the compute nodes. Noise has a heavy impact on large scale HP
> work loads that can run days or even weeks at a time. Lets take an
> example system:
>
> |-------------| |-------------|
> |-------| | NUMA 0 | | NUMA 1 | |-------|
> | eth0 | - | | --- | | - | eth1 |
> |_______| | CPU0 CPU1 | | CPU2 CPU3 | |_______|
> |_____________| |_____________|
>
> In such a system it is possible with the right job schedular to start a
> large parallel application on NUMA 0/ (CPU0 and CPU1). Normally such
> large parallel applications will communicate between nodes using MPI,
> such as openmpi, which can be configured to use eth0 only. Using the
> CPT layer in lustre we can isolate lustre to NUMA 1 and use only eth1.
> This greatly reducess the noise impact on the application running.
>
> BTW this is one of the reasons ko2iblnd for lustre doesn't use the
> generic RDMA api. The core IB layer doesn't support such isolation.
> At least to my knowledge.
Thanks for that background (and for the separate explanation of how
jitter multiplies when jobs needs to synchronize periodically).
I can see that setting CPU affinity for lustre/lnet worker threads could
be important, and that it can be valuable to tie services to a
particular interface. I cannot yet see why we need partitions for this,
rather that doing it at the CPU (or NODE) level.
Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180706/b207f4ab/attachment.sig>
More information about the lustre-devel
mailing list