[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]

David Knaak knaak at cray.com
Tue Jun 2 09:11:25 PDT 2009


Rob,

As I mentioned earlier, other projects have made it difficult for me
to devote much time to this.  Yesterday I did build the full 1.1rc1
version of ROMIO in the Cray XT MPICH2 build.  I then ran one of the
test cases that had previously failed with my partially merged version
of the earlier Lustre code before I fixed it.  (This fix is in the Cray
release 3.2.0 that we released in April.)  The test failed yesterday in
the same way with the vanilla 1.1rc1 ROMIO.

The simplest test that I reduced it to is an IOR run with the following
combinations:

* clients = 7
* cb_nodes = 4
* striping_factor = 4
* striping_unit = 1048576
* romio_cb_write = enable
* Command: ./IOR -a MPIIO -cCg -w -W -b 8m -t 8m -o file

On the validation step (-W), IOR detects incorrect data.  The basic 
problem is in "ADIOI_LUSTRE_W_Exchange_data" when "buf_idx" is updated:

                buf_idx[i] += send_size[i];

When the work loads are nice and even on each call to
"ADIOI_LUSTRE_W_Exchange_data", this works but in a case like 7 clients
and 4 aggregators, this simple updating of "buf_idx" for the next call
is not correct.

My fix is to save all the "buf_idx" values in the "ADIOI_Access my_req"
structure and use the saved values rather than trying to recompute then
in ADIOI_LUSTRE_W_Exchange_data.  So in "ADIOI_LUSTRE_Calc_my_req", a
"buf_idx" array is allocated and set similar to how "offsets" and "lens"
are allocated and set:

            my_req[i].offsets = (ADIO_Offset *)
                                ADIOI_Malloc(count_my_req_per_proc[i] *
                                             sizeof(ADIO_Offset));
            my_req[i].lens = (int *)
                             ADIOI_Malloc(count_my_req_per_proc[i] *
                                          sizeof(int));
            my_req[i].buf_idx = (int *)
                                ADIOI_Malloc(count_my_req_per_proc[i] *
                                             sizeof(int));

and

            my_req[proc].offsets[l] = off;
            my_req[proc].lens[l] = (int) avail_len;
            my_req[proc].buf_idx[l] = curr_idx;

Then "ADIOI_LUSTRE_Exch_and_write" always has the correct "buf_idx"
value when it calls "ADIOI_LUSTRE_W_Exchange_data".

It won't be quick and easy for me to give you a clean patch but I can
send you my version of the routines.  It is my plan to fully compare,
test, and merge code as time allows this summer.

David

On Mon, Jun 01, 2009 at 05:25:04PM -0500, Rob Latham wrote:
> On Mon, May 11, 2009 at 09:28:12AM -0500, Rob Latham wrote:
> > So, the real challenges are coll_test, noncontig_coll, hindexed,
> > aggregation1, aggregation2, split_coll... basically, collective I/O is
> > messed up. 
> 
> Hi.  I haven't had a chance to debug this.  How about any of you?
> The MPICH2 folks would like to release 1.1 tomorrow.  
> 
> I propose disabling the auto-detection of Lustre until we can fix
> this.  I don't want anybody upgrading to MPICH2-1.1 on a lustre system
> and getting corruption with collective i/o.  
> 
> At the same time I also don't want to back out all the lustre changes,
> though, since I'm sure we are close.  For testing and debugging, we
> can explicitly exercise the Lustre path by prefixing the file name
> with 'lustre:'   
> 
> If we can find the fix, we can incorporate it into the follow-on
> patch-release, roughly scheduled for end of summer.
> 
> ==rob
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA

-- 



More information about the lustre-discuss mailing list