[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]

Rob Latham robl at mcs.anl.gov
Thu Jul 2 08:25:16 PDT 2009


Hello! Sorry for not following up with you sooner.  I'll tell you what
I think of these approaches from a ROMIO perspective, and maybe the
Cray and Lustre folks can provide their perspective and clear up any
questions we've got..

On Fri, Jun 12, 2009 at 05:35:17PM +0200, pascal.deveze at bull.net wrote:
> I've made 4 modifications on the new Lustre ADIO driver. After that, all
> the romio tests succeed.
> 
> The application "coll_test" also passes on 2 nodes (4 processes - 2OST)
> with the dimension of  the 3 d array varying from 1 to 300.
> 
> The version I use comes from mpich2-1.1.
> 
> Hereafter is a quick explanation on my four modifications:
> 
> 1) The calculation of avail_cb_nodes in ADIOI_LUSTRE_Get_striping_info()
>     For example, when nprocs_for_coll=2, stripe_count=2 and CO=1 ,
> avail_cb_nodes is set to 1.
>     The value 2 should be used instead.
> 
>    I propose to change the lines by:
> 
>      /* avail_cb_nodes should divide stripe_count exactly */
>            while (stripe_count % avail_cb_nodes) {
>                      avail_cb_nodes--;
>            }

CO is a ratio of how many compute nodes should participate in
collective I/O, right?   I've kind of forgotten what this code is
attempting to do: I mean I can see that it's setting up the
stripe_size, stripe_count, and avail_cb_nodes for the subsequent call
to ADIOI_LUSTRE_Calc_my_req(), but I forget why there is both a CO and
a CO_max.


> 2) The parameter len_list_ptr has been modified in include/adioi.h, so I
> propose to change :
>                int **len_list_ptr;
>       to
>                ADIO_Offset **len_list_ptr;
> 
> in ad_lustre_aggregate.c and ad_lustre_wrcoll.c

No argument here. That's clearly the right thing to do.  I've
committed that change.  Take a look at revision 4889 and let me know
if I missed anything

> 3) Use of buf_idx[ ] in ADIOI_LUSTRE_W_Exchange_data
> 
>       I had a lot of troubles with the table of pointers buf_idx[ ]:
> "coll_test" with different dimensions of the 3d array detected a lot
>       of errors.
> 
>   I solved the problems by using only one pointer: buf_idx[0] initialized
> by 0 in
>   "ADIOI_LUSTRE_Calc_my_req" and used/modified in
> "ADIOI_LUSTRE_W_Exchange_data":
>             if (send_size[i]) {
>                 MPI_Isend(((char *) buf) + buf_idx[0], send_size[i],
>                           MPI_BYTE, i, myrank + i + 100 * iter, fd->comm,
>                           send_req + j);
>                 j++;
>                 buf_idx[0]+=send_size[i];
>                 }
>    Of course, a single variable may be used to do that, but this implies
>    more changes.
>    This solution works fine with coll_test (used with a wide range of
>    dimensions of the 3d array, on 4 nodes and 2 OST). Maybe this solution
> is too simple,
>    and I missed something. What is your opinion ?

I'm a little nervous about this one, especially if the cb_config_list
hint was given.   Could I get a second opinion on this one from
anyone?

> 4) Macro ADIOI_BUFFERED_WRITE_WITHOUT_READ in ad_lustre_wrstr.c (line 213)
>    This macro causes an abort in the tests "i_concontig" and "noncontig".
> 
>    In this macro, ADIO_WriteContig was called a lot of time with
> writebuf_len egal to 0.
>    What is wanted is to copy the data (by memcpy) in the writebuf as much
> as possible
>    to fill a stripe and then to call ADIO_WriteContig. As far as I
> understand it, this was not
>    the case.
> 
>    I replaced this macro by a call to
> 
>        memcpy(writebuf + req_off - writebuf_off, (char *)buf + user_off,
> req_len);
> 
>    This modification works only if it is possible to copy all the user data
>    in the write buffer (writebuf). (i.e. if bufsize <= writebuf_len)
>    If bufsize is higher, a loop has to be introduced.
> 
>    What do you think about this change ?

That's delicate code to say the least ... Can we just check for
writebuf_len equal to 0 and short circuit things?  The macro is kind
of hairy, but it already handles the bufsize > writebuf_len case and
has seen a decade of testing and bugfixes.

> You will find attached my modification vs mpich2-1.1. Your questions and
> comments about them are welcome.

Thanks much for including the patch. that's a huge help when having
this kind of conversation.  In the future, can you send it in "unified
diff" format (-u)?  Also, you made four kinds of changes.  I know how
much a pain it is, but is it possible next time to split up the patch
into four pieces, one for each topic?
 
Thanks for your help with the Lustre driver, and for fielding my
questions!

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA



More information about the lustre-discuss mailing list