[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]

emoly.liu Emoly.Liu at Sun.COM
Tue Jan 13 08:16:34 PST 2009

Hi rob,

Thanks for reviewing the patch so carefully. Sorry for this delay reply.

Robert Latham wrote:
> On Fri, Nov 28, 2008 at 01:33:39PM +0800, emoly.liu wrote:
>> Hello,
>> Per Rob's comment, I send the Lustre ADIO patch here. FYI, the brief  
>> introduction is given below.
>> Any advice is welcome.
> Hi LiuYing
> I finally had a chance to look at your patch.  Sorry it took me so
> long.  I'm glad to see the Lustre folks taking MPI-IO more seriously.  
> First off, the patch looks pretty good.  It's clear why the approaches
> you're taking give you the improvements in MPI-IO performance.
> Naturally, I've got some comments.
> Your patch modifies or creates 11 files.  I'll go through it file by
> file and tell you what I think.    I've also copied Wei-keng Liao from
> Northwestern.  He has done a lot of work tuning ROMIO for lustre and
> he might have some comments or suggestions, as he has done some of
> these things in a slightly different way.  Weikuan, I presume you
> worked with LiuYing on this patch?  If not, any comments from you?
> OK, here we go!
>  ad_lustre/Makefile.in
> - just fine. building new objects
>  ad_lustre/README 
> - I'm happy to have this, and you might want to add even more content
>  ad_lustre/ad_lustre.c
> - Just a small change to call the two new lustre-specific strided and
>   strided-collective routines
>  ad_lustre/ad_lustre.h
> - bring in lustre structs/constants if "lustre_user.h" not getting
>   brought in: is this really necessary?  ROMIO won't try to build
>   ad_lustre unless 'lustre' is part of "--with-file-systems" at
>   configure time
 --with-file-systems=lustre only tells it will build lustre adio, but it doesn't show where is the lustre_user.h, which is needed by ad_open.c (for setting stripe stuff).
So if there is no lustre_user.h in /user/include (no lustre-devel rpm in your system),  hmm, maybe we should only give error msg at this moment, instead of adding this if.

>   ad_lustre/ad_lustre_aggregate.c 
> - Lots of changes.  I see you started from adio/common/ad_aggregate.c
> - ADIOI_LUSTRE_Docollect: interesting policy to avoid collective I/O
>   if any piece bigger than a big_req hint.
> - You removed a lot of comments. why?
Some comments are not suitable to lustre adio driver and some were 
probably removed during the codes cleanup. I will restore the latter.
> - here again, you should refer to ROMIO hint struct instead of calling
>   MPI_Info functions
Sorry, I am not sure what ROMIO hint structure you mean. I assume it is 
ADIOI_Hints_struct. I can add lustre hints to that struct, but they are 
lustre related, not common. Is that ok?
If I misunderstand, please advise.
> - maybe we need, as wei-keng suggests, a better separation so you
>   don't need to duplicate Calc_my_req just so you can call the
>   lustre-sepecific Calc_aggregator.   Could a hint do this for you?
yes, that would be better if there is a separation.
> - same for Calc_others_req.  I see you want to change ROMIO policy,
>   and that makes sense for you, but too bad we don't have a better way
>   than for you to copy huge chunks of romio code only to make small
>   modifications.
yes, in Calc_others_req we try to reduce MPI_Alltoall operations to get 
some performance improvement.
>  ad_lustre/ad_lustre_hints.c
> - new hints: please prefix lustre-specific with 'lustre' or
>   'romio_lustre'  Yes, we are inconsistent about this, so you probably
>   learned this bad habbit from us.  Please do as I say, not as I do
>   (or have done in the past :> )
sure, I will. :-)
> - Also, note that for all the ROMIO hints, we add to an internal-to
>   ROMIO hint structure.  It's a bit faster to read from these
>   "natural" hints instead of going through the MPI_Info interface and
>   parsing strings.
mentioned above
>  ad_lustre/ad_lustre_open.c
> -  looks fine.  I'm a little concerned about rank 0 setting
>    stripe/stride with ioctl while everyone sits in barrier.  ADIO_Open
>    will call lustre_open with COMM_SELF in create case, but that
>    doesn't help since you need to be able to adjust on open.  What
>    happens when you adjust an already-created file?
You're right.
We shouldn't set lustre striping info only according to O_CREAT flag 
because fd will return a wrong value for an existing file.
I will fix it. Thanks.
>   ad_lustre/ad_lustre_rwcontig.c  
> - just new copyright... but nothing changed?
Sorry, I will remove it.
>  ad_lustre/ad_lustre_wrcoll.c
> - this is more like a fork of adio/common/ad_write_str.c and not a
>   copy.  Watch out for changes we've made in adio/common. I can make
>   an honest effort to change ad_lustre when we change adio/common, but
>   we don't have a lustre test system.
> - missing some 64-bit Aint casting
I will add it.
> - took out most of the comments from adio/common/ad_write_coll.c -- why?
I will restore some.
> - missing the "new MPE" logging anthony put into rest of ROMIO
I had a look at MPE events defined in adioi.h. Could you tell me what 
else MPE logs you hope to see?
> - yet another implementation of "rounding domains to stripe size".  
> - interesting: disable data sieving in two-phase
In our tests, data sieving showed bad collective write performance for 
some kinds of workloads. To avoid this, we define ds_in_coll hint, 
distinguished from the one in independent I/O.
> - optimization: instead of MPI_Info_get, use ROMIO hint struct.
> - missing wei-keng's reworked type processing 
Sorry, I checked the codes but I didn't find anything missed commented 
by wei-keng.
Could you tell me what type processing is missed in this file?
>  ad_lustre/ad_lustre_wrstr.c
> - some formating changes
> - also missing some 64-bit Aint casts
I will add it.
>  common/ad_write_coll.c          
> - make ADIOI_Heap_merge "public" to ROMIO.  That's fine: sure beats copying it
> So, for the next steps, could you at the very least do the following:
> - change the hint names 
> - use the fd->hints structure instead of MPI Info routines when you
>   are strictly internal to ROMIO.  Continue to populate the users
OK, I will refine the patch per your comments and send it to you later.

> Other than that, I'll commit this to our trunk.  I will need Sun and
> Oak Ridge's help to keep the ad_lustre driver up to date.   
> Ideally, we would have most of the ad_lustre functionality as "tunable
> knobs" in the common/ code.  That's a matter of software engineering
> we just haven't tackled yet, so a separate ad_lustre is just fine for
> now.
> Thanks for the patch!
If I misunderstand or miss anything, please let me know.
Thank you so much!

> ==rob
>> -------- Original Message --------
>> Subject: 	Re: [Lustre-discuss] [Lustre-devel] a new Lustre ADIO driver
>> Date: 	Wed, 12 Nov 2008 11:35:48 -0600
>> From: 	Robert Latham <robl at mcs.anl.gov>
>> To: 	emoly.liu <Emoly.Liu at Sun.COM>
>> CC: 	lustre-discuss <lustre-discuss at lists.lustre.org>,  
>> lustre-devel at lists.lustre.org
>> References: 	<49110B1A.5030201 at sun.com> <49114151.10908 at sun.com>
>> On Wed, Nov 05, 2008 at 02:46:41PM +0800, emoly.liu wrote:
>>> Hi,
>>> Here is some description and performance results for the new Lustre 
>>> ADIO  driver. You might be interested.
>>> The main purpose of the new lustre adio driver is to collect the small  
>>> size I/O requests into fewer number of larger ones according to 
>>> Lustre's  striping feature with low overhead automatically. We also 
>>> provide some  hints for performance tuning.
>>> In the following graph, the recent FLASH I/O benchmarking result shows  
>>> that the new driver performs much better than the old one, especially   
>>> for a large-scale system.
>>> In this test, FLASH I/O blocksize is 4*4*4 that means each 3 clients   
>>> write 40KB, 40.5KB and 42KB respectively. The I/O operation iteration   
>>> (nvar) is set to 1024.
>> Hi LiuYing
>> Since this patch is for ROMIO and not really for Lustre, could you
>> send this to romio-maint at mcs.anl.gov ?  It will reach more of the
>> ROMIO developers that way.
>> Thanks
>> ==rob
Best regards,

System Software Engineer, Lustre Group
Sun Microsystems ( China ) Co. Limited

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090114/2505f635/attachment.htm>

More information about the lustre-discuss mailing list