[Lustre-devel] protocol backofs

Mon Mar 16 15:13:18 PDT 2009

On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote:
> Howdy Isaac,
>    Nice to meet you.  As Eric suggested I am also cc:ing Nick Henke, 
> since he might find this an interesting discussion.  For all you 
> lustre-devel dwellers out there, feel free to chime in.

Hi Andrew.  Yes, there is no way to avoid me...  I don't have too much
information about Lustre but I can tell you a bit about Madbench and
MPI-IO.

> b)  Why is the contention introduced only in the MPI-I/O test and not in 
> the POSIX test?  Does the MPI-I/O from Cray's xt-mpt/3.1.0 divert I/O to 
> a subset of nodes so that all the I/O is going through a smaller section 
> of the torus?

Cray's MPI-IO is old enough that it's doing "generic unix" file system
operations.  (I've committed the optimized Lustre driver, but it will
take some time for it to end up on a Cray). 

Madbench is doing independent I/O, though, so optimized or no, there
is no "aggregation" -- it's a shame, too, as it sounds like
aggregation would at least rule out your contention theory.  

You've essentially written this up on your website already, but for
the wider lustre-devel audience, The MPI-IO in Madbench is dead
simple: 

MPI_File_seek
MPI_File_read or MPI_File_write (or the nonblocking versions)
MPI_Barrier

This is *almost* an exact correspondance to the POSIX case:

fseeko64
fread or fwrite
fclose

Did you see the difference?  I know you did because you wrote
http://www.nersc.gov/~uselton/sf-mpi.html

How big is an individual madbench I/O operation for you?  We ran some
I/O tests with madbench on our bluegene that showed about 20 MB per
operation -- large enough that i'd be surprised if the libc buffering
was having much effect.

So, off the top of my head I don't have too many ideas from an MPI-IO
perspective.  Your graphs suggest irregular performance on franklin
for both reads and writes
(http://www.nersc.gov/~uselton/frank_jag/20090215183709/rate.png), so
that kind of rules out interference from the lock manager.

to me, your contention idea is still in play.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA                 B29D F333 664A 4280 315B