[Lustre-devel] protocol backofs
Andrew C. Uselton
acuselton at lbl.gov
Mon Mar 16 15:41:44 PDT 2009
Robert Latham wrote:
> On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote:
>> Howdy Isaac,
...
>
> Hi Andrew. Yes, there is no way to avoid me... I don't have too much
> information about Lustre but I can tell you a bit about Madbench and
> MPI-IO.
>
Glad to hear from you :)
...
> Cray's MPI-IO is old enough that it's doing "generic unix" file system
> operations. (I've committed the optimized Lustre driver, but it will
> take some time for it to end up on a Cray).
>
I am looking over David Knaak's shoulder even as we speak (electron?).
> Madbench is doing independent I/O, though, so optimized or no, there
> is no "aggregation" -- it's a shame, too, as it sounds like
> aggregation would at least rule out your contention theory.
When you say "independent" you mean it isn't using MPI "collective" I/O,
yes? That is true, just making sure I understand your comment.
>
> How big is an individual madbench I/O operation for you? We ran some
I usually run madbench "as large as possible". That ends up with the
target buffer for I/O in the 300 MB range.
>
> So, off the top of my head I don't have too many ideas from an MPI-IO
> perspective. Your graphs suggest irregular performance on franklin
> for both reads and writes
> (http://www.nersc.gov/~uselton/frank_jag/20090215183709/rate.png), so
> that kind of rules out interference from the lock manager.
There is some variability in the writes (and reads in other tests), but
the MPI-I/O, middle-phase reads seem to be a special case. Those delays
are an order of magnitude higher and do not seem to correspond to any
I/O activity. That's why I'm hoping for a protocol backoff induced by
congestion. Also note that in that phase, and only in that phase, each
node has been given 1.2 GB to send to the file and immediately asked to
read that much back in from a different offset. I've looked quite
carefully and none of the I/O is outside its locked range as established
in the first "writes" phase, so there should be no lock traffic during
this phase. So in this middle phase there may be extra resource
contention in kernel space on each node. So an alternative might be a
low-probability near-deadlock on those resources where writes are still
being drained but reads are already demanding attention.
>
> to me, your contention idea is still in play.
>
> ==rob
>
I think I forgot to mention: NERSC is soon planning to extend the
Franklin I/O resources so they look a lot more like Jaguar's. When they
do we'll be able to "do the experiment", in that if the delay disappears
that argues for contention in the torus getting to the OSSs or in the
OSSs themselves. I'm still stumped for why it would only happen in the
MPI-I/O case, though.
Cheers,
Andrew
More information about the lustre-devel
mailing list