[Lustre-discuss] 1.6.4.3 changes CPU mapping in RHEL., and 10x performance loss in one application

Wed Apr 23 08:34:33 PDT 2008

On Tue, Apr 22, 2008 at 10:25 PM, Aaron Knister <aaron at iges.org> wrote:
> Did you ever find a resolution?

The core mapping change had to do with "noacpi" on my kernel command
line (ACPI, not APIC).  It seems ACPI has a lot to do with core
mapping (not just power).  It also effected interrupt
distribution/balancing (/proc/interrupts was showing all timer
interrupts handled by CPU0, for example).  ACPI had to both be defined
in the kernel and not disabled on the kernel command line.

This did not solve the 2x to 10x performance issue with 1.6.4.3, but I
don't have that problem with a manually patched RHEL 2.6.9-67.0.4
kernel in 1.6.4.2.  My best guess is: I omitted the Quadrics patches
from my manual patching... maybe they have something to do with the
slowdown.  I have a list of system calls that I believe are associated
with the slowdown... but in looking at CPU counters, the application
takes no more CPU time, the walltime just increases... like the kernel
is forgetting  to schedule the app.  More on this at:

https://bugzilla.lustre.org/show_bug.cgi?id=15478

> And out of curiosity, how did you determine
> that the core to logical processor allocation had changed? I'm trying to
> figure it out in my own set up.

A quick glance at /proc/cpuinfo shows the difference.  The "correct"
case looks like:

# cat /proc/cpuinfo | grep -e processor -e "core id"
processor       : 0
core id         : 0
processor       : 1
core id         : 2
processor       : 2
core id         : 4
processor       : 3
core id         : 6
processor       : 4
core id         : 1
processor       : 5
core id         : 3
processor       : 6
core id         : 5
processor       : 7
core id         : 7

The "incorrect" mapping shows "processor" == "core id" (as it does
above for cpu's 0 and 7... but for all processors).

I work w/ benchmark clusters (they are only used for benchmarking and
tuning applications), and many immediately saw the differences in
codes they'd been benchmarking.  Some folks run on fewer cores than
are available per node (i.e. to not share cache between MPI processes,
or, in some cases of multithreaded apps, they do want to share cache),
and the optimal MPI CPU mapping for an 8 core system (at least for
this vendors CPUs) puts logical cores 0 and 1 on a different socket, 2
and 3 share sockets w/ 0 and 1, but different L2 caches.  With ACPI
disabled, the logical and physical mapping were the same.  In those
cases where the MPI does process pinning, the apps were (mostly)
okay... but other apps don't specifically pin, and, where
logical==physical, all four processes were running on the same socket
and their performance went down.  You could argue that apps should pin
if needed, but also argue that it's nice to have a CPU mapping that
helps apps that don't pin.

Furthermore, others noticed that even w/ proper processor pinning and
using physical processors 0, 2, 4, 6 there results were worse than
using 1,3,5,7... this turned out again to be ACPI related where
interrupts weren't being balanced across the CPUs (look at the first
line of "timer" interrupts in /proc/interrupts and see if all go to
CPU0... that imbalance will effect performance on MPI apps that use
all cores too).

Hope that helps.

Chris
>
>  -Aaron
>
>
>
>  On Apr 10, 2008, at 2:13 PM, Chris Worley wrote:
>
>
> >
> >
> >
> > On Thu, Apr 10, 2008 at 12:05 PM, Johann Lombardi <johann at sun.com> wrote:
> >
> > > Chris,
> > >
> > >
> > > On Sat, Apr 05, 2008 at 06:11:32PM -0600, Chris Worley wrote:
> > >
> > > > I was running RHEL's 2.6.9-67.0.4 kernel w/o Lustre patches, and the
> > > >
> > >
> > > What is the CPU architecture? x86_64 or IA64?
> > >
> >
> > x86_64.
> >
> >
> > >
> > >
> > >
> > > > core to logical processor allocation was (as shown by /proc/cpuinfo):
> > > >
> > > >
> > > >      =============      =============    Socket
> > > >
> > > >      ====== ======       ======  ======    L2 cache domain
> > > >
> > > >       0    4      1     5          2      6     3     7     logical
> processor
> > > >
> > > >
> > > > After installing the Lustre version of the kernel, the allocation is:
> > > >
> > > >      =============      =============    Socket
> > > >
> > > >      ======  ======      ======  ======    L2 cache domain
> > > >
> > > >       0     1      2      3        4      5      6    7     logical
> processor
> > > >
> > >
> > > Hard to believe that one of our patches could cause this.
> > > Have you compared the kernel config files?
> > >
> >
> > This is the default from RedHat vs. the default from
> > downloads.lustre.org.  We didn't rebuild either from scratch.
> >
> > Chris
> >
> > >
> > > Cheers,
> > > Johann
> > >
> > >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
>
>  Aaron Knister
>  Associate Systems Analyst
>  Center for Ocean-Land-Atmosphere Studies
>
>  (301) 595-7000
>  aaron at iges.org
>
>
>
>
>