[lustre-discuss] Issue running IOR

Fri Sep 2 07:40:02 PDT 2016

Good Afternoon,

  I'd asked earlier about what other bandwidth tests could be run since I'd encountered an issue with IOR, and it was suggested that I ask here about the problem I'm having with IOR.

The setup is as follows...

We have a cluster of Intel KNL nodes that communicate over an omnipath fabric, and that reach lustre through a pair of lnet gateways (that leads to an infiniband network).  These pieces run Centos 7.2, the routers run a stock kernel and the nodes run a kernel provided by Intel.  If I mount the lustre file system on one of the lnet routers (or one of the knl nodes with the CentOS provided kernel) IOR works, but once I install the knl kernel IOR breaks.

knl node: 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64
lnet router kernel: 3.10.0-327.el7.x86_64

The OPA (omnipath architecture) stack comes with its own openmpi implementation, and that has been compiled against the special kernel.

When I compile IOR on these nodes and run it I get the below error (even if it is just a single node test).

 xxxxxx ~> ./IOR
--------------------------------------------------------------------------
Error obtaining unique transport key from ORTE (orte_precondition_transports not present in
the environment).

  Local host: xxxxxxx
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[xxxxxx:79453] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!

xxxxxx ~> mpicc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) 

xxxxxxx ~> rpm -qa | grep openmpi
openmpi-1.10.0-10.el7.x86_64
mpitests_openmpi_intel_hfi-3.2-930.x86_64
mpitests_openmpi_gcc-3.2-930.x86_64
openmpi_gcc_hfi-1.10.2-8.x86_64
openmpi_intel_hfi-1.10.2-8.x86_64
openmpi_pgi_hfi-1.10.2-8.x86_64
openmpi_gcc-1.10.2-8.x86_64
mpitests_openmpi_gcc_hfi-3.2-930.x86_64
mpitests_openmpi_pgi_hfi-3.2-930.x86_64

w/r,
Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility