[Lustre-devel] protocol backofs

Mon Mar 16 15:41:44 PDT 2009

Robert Latham wrote:
> On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote:
>> Howdy Isaac,
...
> 
> Hi Andrew.  Yes, there is no way to avoid me...  I don't have too much
> information about Lustre but I can tell you a bit about Madbench and
> MPI-IO.
> 
Glad to hear from you :)
...
> Cray's MPI-IO is old enough that it's doing "generic unix" file system
> operations.  (I've committed the optimized Lustre driver, but it will
> take some time for it to end up on a Cray). 
> 
I am looking over David Knaak's shoulder even as we speak (electron?).

> Madbench is doing independent I/O, though, so optimized or no, there
> is no "aggregation" -- it's a shame, too, as it sounds like
> aggregation would at least rule out your contention theory.  

When you say "independent" you mean it isn't using MPI "collective" I/O, 
yes?  That is true, just making sure I understand your comment.

> 
> How big is an individual madbench I/O operation for you?  We ran some

I usually run madbench "as large as possible".  That ends up with the 
target buffer for I/O in the 300 MB range.

> 
> So, off the top of my head I don't have too many ideas from an MPI-IO
> perspective.  Your graphs suggest irregular performance on franklin
> for both reads and writes
> (http://www.nersc.gov/~uselton/frank_jag/20090215183709/rate.png), so
> that kind of rules out interference from the lock manager.

There is some variability in the writes (and reads in other tests), but 
the MPI-I/O, middle-phase reads seem to be a special case.  Those delays 
are an order of magnitude higher and do not seem to correspond to any 
I/O activity.  That's why I'm hoping for a protocol backoff induced by 
congestion.  Also note that in that phase, and only in that phase, each 
node has been given 1.2 GB to send to the file and immediately asked to 
read that much back in from a different offset.  I've looked quite 
carefully and none of the I/O is outside its locked range as established 
in the first "writes" phase, so there should be no lock traffic during 
this phase.  So in this middle phase there may be extra resource 
contention in kernel space on each node.  So an alternative might be a 
low-probability near-deadlock on those resources where writes are still 
being drained but reads are already demanding attention.

> 
> to me, your contention idea is still in play.
> 
> ==rob
> 

I think I forgot to mention:  NERSC is soon planning to extend the 
Franklin I/O resources so they look a lot more like Jaguar's.  When they 
do we'll be able to "do the experiment", in that if the delay disappears 
that argues for contention in the torus getting to the OSSs or in the 
OSSs themselves.  I'm still stumped for why it would only happen in the 
MPI-I/O case, though.
Cheers,
Andrew