[Lustre-discuss] 1.6.4.3 changes CPU mapping in RHEL., and 10x performance loss in one application
Chris Worley
worleys at gmail.com
Sat Apr 5 17:11:32 PDT 2008
Two issues with the RHEL 2.6.9-67.0.4 Lustre kernel vs. the kernel of
the same rev from RH...
1) Need to understand how the patches caused a change in the CPU
mapping in the kernel.
I was running RHEL's 2.6.9-67.0.4 kernel w/o Lustre patches, and the
core to logical processor allocation was (as shown by /proc/cpuinfo):
============= ============= Socket
====== ====== ====== ====== L2 cache domain
0 4 1 5 2 6 3 7 logical processor
After installing the Lustre version of the kernel, the allocation is:
============= ============= Socket
====== ====== ====== ====== L2 cache domain
0 1 2 3 4 5 6 7 logical processor
Why and where was this changed? I have a user that seems to really care.
2) One user has an application whose performance took a 10x nosedive
after the change in kernel from RHEL's 2.6.9-67.0.4 to Lustre's kernel
of the same rev.
The application can run w/ and w/o MPI. It uses HP-MPI. In a
single-node 8-processor case, running both with and without MPI shows
the difference... so it's something in HP-MPI, but it's happening even
on a single node (no IB) and it's also peculeure to this app (another
HP-MPI user saw only an 8% degradation in performance after this
kernel change).
In looking at a histogram of the system calls in the app having
issues, there were lots of calls to: sched_yield, rt_sig* functions,
getppid, and "select". It also calls "close" a lot with invalid
handles! This app also has >600 threads running (on a single node!)
during its lifespan.
Any idea what happened to this apps performance in the kernel change?
Note that I didn't patch the kernel myself, I'm using the kernel from
the lustre.org web site.
Thanks,
Chris
More information about the lustre-discuss
mailing list