[Lustre-discuss] [mpich-discuss] testing new ADIO Lustre code

Wed May 23 11:28:51 PDT 2012

On Thu, May 17, 2012 at 03:02:48PM -0600, Martin Pokorny wrote:
> Hi, everyone;
> 
> I've been using MPI-IO on a Lustre file system to good effect for a
> while now in an application that has up to 32 processes writing to a
> shared file. However, seeking to understand the performance of our
> system, and improve on it, I've recently made some changes to the
> ADIO Lustre code, which show some promise, but need more testing.
> Eventually, I'd like to submit the code changes back to the mpich2
> project, but that is certainly contingent upon the results of
> testing (and various code compliance issues for mpich2/romio/adio
> that I will likely need to sort out.) This message is my request for
> volunteers to help test my code, in particular for output file
> correctness and shared-file write performance. If you're interested
> in doing shared file I/O using MPI-IO on Lustre, please continue
> reading this message.

Gosh, Martin, I really thought you'd get more attention with this
post.  

I'd like to see these patches: I can't aggressively test them on a
lustre system but I'd be happy to provide another set of
ROMIO-eyeballs.  

> In broad terms, the changes I made are on two fronts: changing the
> file domain partitioning algorithm, and introducing non-blocking
> operations at several points. 

Non-blocking communication or i/o ?

One concern with non-blocking I/O in this path is that often the
communication and I/O networks are the same thing (e.g. infiniband, or
the BlueGene tree network in some situations).  

> The file domain partitioning algorithm
> that I implemented is from the paper "Dynamically Adapting File
> Domain Partitioning Methods for Collective I/O Based on Underlying
> Parallel File System Locking Protocols" by Wei-keng Liao and Alok
> Choudhary. The non-blocking operations that I added allow the ADIO
> Lustre driver better to parallelize the data exchange and writing
> procedures over multiple stripes within each process writing to one
> Lustre OST,

I was hoping Wei-keng would chime in on this.  I'll be sure to draw
your patches to his attention.

> My testing so far has been limited to four nodes, up to sixteen
> processes, writing to shared files on a Lustre file system with up
> to eight OSTs. 

Right now the only concern I have is that you may (and without looking
at the code I have no way of knowing) traded better small-scale
performance for worse large-scale performance.    

> These tests were conducted to simulate the production
> application for which I'm responsible, but on a different cluster,
> focused only on the file output. In these rather limited tests, I've
> seen write performance gains of up to a factor of two or three. The
> new file domain partitioning algorithm is most effective when the
> number of processes exceeds the number of Lustre OSTs, but there are
> smaller gains in other cases, and I have not seen instance in which
> the performance has decreased. As an example, in one case using
> sixteen processes, MPI over Infiniband, and a file striping factor
> of four, the new code achieves over 800 MB/s, whereas the standard
> code achieves 300 MB/s. I have hints that the relative performance
> gains when using a 1Gb Ethernet rather than Infiniband for MPI
> message passing are greater, but I have not completed my testing in
> that environment.
> 
> If you're willing to try out this code in a test environment please
> let me know. I have not yet put the code into a publicly accessible
> repository, but will do so if there is interest out there.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA