[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
emoly.liu
Emoly.Liu at Sun.COM
Tue Jan 13 08:16:34 PST 2009
Hi rob,
Thanks for reviewing the patch so carefully. Sorry for this delay reply.
Robert Latham wrote:
> On Fri, Nov 28, 2008 at 01:33:39PM +0800, emoly.liu wrote:
>
>> Hello,
>>
>> Per Rob's comment, I send the Lustre ADIO patch here. FYI, the brief
>> introduction is given below.
>> Any advice is welcome.
>>
>
> Hi LiuYing
>
> I finally had a chance to look at your patch. Sorry it took me so
> long. I'm glad to see the Lustre folks taking MPI-IO more seriously.
>
> First off, the patch looks pretty good. It's clear why the approaches
> you're taking give you the improvements in MPI-IO performance.
> Naturally, I've got some comments.
>
> Your patch modifies or creates 11 files. I'll go through it file by
> file and tell you what I think. I've also copied Wei-keng Liao from
> Northwestern. He has done a lot of work tuning ROMIO for lustre and
> he might have some comments or suggestions, as he has done some of
> these things in a slightly different way. Weikuan, I presume you
> worked with LiuYing on this patch? If not, any comments from you?
>
> OK, here we go!
>
> ad_lustre/Makefile.in
> - just fine. building new objects
>
> ad_lustre/README
> - I'm happy to have this, and you might want to add even more content
>
yes
> ad_lustre/ad_lustre.c
> - Just a small change to call the two new lustre-specific strided and
> strided-collective routines
>
> ad_lustre/ad_lustre.h
> - bring in lustre structs/constants if "lustre_user.h" not getting
> brought in: is this really necessary? ROMIO won't try to build
> ad_lustre unless 'lustre' is part of "--with-file-systems" at
> configure time
--with-file-systems=lustre only tells it will build lustre adio, but it doesn't show where is the lustre_user.h, which is needed by ad_open.c (for setting stripe stuff).
So if there is no lustre_user.h in /user/include (no lustre-devel rpm in your system), hmm, maybe we should only give error msg at this moment, instead of adding this if.
> ad_lustre/ad_lustre_aggregate.c
> - Lots of changes. I see you started from adio/common/ad_aggregate.c
>
> - ADIOI_LUSTRE_Docollect: interesting policy to avoid collective I/O
> if any piece bigger than a big_req hint.
> - You removed a lot of comments. why?
>
Some comments are not suitable to lustre adio driver and some were
probably removed during the codes cleanup. I will restore the latter.
> - here again, you should refer to ROMIO hint struct instead of calling
> MPI_Info functions
>
Sorry, I am not sure what ROMIO hint structure you mean. I assume it is
ADIOI_Hints_struct. I can add lustre hints to that struct, but they are
lustre related, not common. Is that ok?
If I misunderstand, please advise.
> - maybe we need, as wei-keng suggests, a better separation so you
> don't need to duplicate Calc_my_req just so you can call the
> lustre-sepecific Calc_aggregator. Could a hint do this for you?
>
yes, that would be better if there is a separation.
> - same for Calc_others_req. I see you want to change ROMIO policy,
> and that makes sense for you, but too bad we don't have a better way
> than for you to copy huge chunks of romio code only to make small
> modifications.
>
yes, in Calc_others_req we try to reduce MPI_Alltoall operations to get
some performance improvement.
> ad_lustre/ad_lustre_hints.c
> - new hints: please prefix lustre-specific with 'lustre' or
> 'romio_lustre' Yes, we are inconsistent about this, so you probably
> learned this bad habbit from us. Please do as I say, not as I do
> (or have done in the past :> )
>
sure, I will. :-)
> - Also, note that for all the ROMIO hints, we add to an internal-to
> ROMIO hint structure. It's a bit faster to read from these
> "natural" hints instead of going through the MPI_Info interface and
> parsing strings.
>
mentioned above
> ad_lustre/ad_lustre_open.c
> - looks fine. I'm a little concerned about rank 0 setting
> stripe/stride with ioctl while everyone sits in barrier. ADIO_Open
> will call lustre_open with COMM_SELF in create case, but that
> doesn't help since you need to be able to adjust on open. What
> happens when you adjust an already-created file?
>
You're right.
We shouldn't set lustre striping info only according to O_CREAT flag
because fd will return a wrong value for an existing file.
I will fix it. Thanks.
> ad_lustre/ad_lustre_rwcontig.c
> - just new copyright... but nothing changed?
>
Sorry, I will remove it.
> ad_lustre/ad_lustre_wrcoll.c
>
> - this is more like a fork of adio/common/ad_write_str.c and not a
> copy. Watch out for changes we've made in adio/common. I can make
> an honest effort to change ad_lustre when we change adio/common, but
> we don't have a lustre test system.
> - missing some 64-bit Aint casting
>
I will add it.
> - took out most of the comments from adio/common/ad_write_coll.c -- why?
>
I will restore some.
> - missing the "new MPE" logging anthony put into rest of ROMIO
>
I had a look at MPE events defined in adioi.h. Could you tell me what
else MPE logs you hope to see?
> - yet another implementation of "rounding domains to stripe size".
> - interesting: disable data sieving in two-phase
>
In our tests, data sieving showed bad collective write performance for
some kinds of workloads. To avoid this, we define ds_in_coll hint,
distinguished from the one in independent I/O.
> - optimization: instead of MPI_Info_get, use ROMIO hint struct.
>
fd->hints
> - missing wei-keng's reworked type processing
>
Sorry, I checked the codes but I didn't find anything missed commented
by wei-keng.
Could you tell me what type processing is missed in this file?
>
>
> ad_lustre/ad_lustre_wrstr.c
> - some formating changes
> - also missing some 64-bit Aint casts
>
I will add it.
> common/ad_write_coll.c
> - make ADIOI_Heap_merge "public" to ROMIO. That's fine: sure beats copying it
>
:-)
> So, for the next steps, could you at the very least do the following:
> - change the hint names
> - use the fd->hints structure instead of MPI Info routines when you
> are strictly internal to ROMIO. Continue to populate the users
> MPI_INFO.
>
OK, I will refine the patch per your comments and send it to you later.
> Other than that, I'll commit this to our trunk. I will need Sun and
> Oak Ridge's help to keep the ad_lustre driver up to date.
>
> Ideally, we would have most of the ad_lustre functionality as "tunable
> knobs" in the common/ code. That's a matter of software engineering
> we just haven't tackled yet, so a separate ad_lustre is just fine for
> now.
>
> Thanks for the patch!
>
If I misunderstand or miss anything, please let me know.
Thank you so much!
LiuYing
> ==rob
>
>
>> -------- Original Message --------
>> Subject: Re: [Lustre-discuss] [Lustre-devel] a new Lustre ADIO driver
>> Date: Wed, 12 Nov 2008 11:35:48 -0600
>> From: Robert Latham <robl at mcs.anl.gov>
>> To: emoly.liu <Emoly.Liu at Sun.COM>
>> CC: lustre-discuss <lustre-discuss at lists.lustre.org>,
>> lustre-devel at lists.lustre.org
>> References: <49110B1A.5030201 at sun.com> <49114151.10908 at sun.com>
>>
>>
>>
>> On Wed, Nov 05, 2008 at 02:46:41PM +0800, emoly.liu wrote:
>>
>>> Hi,
>>>
>>> Here is some description and performance results for the new Lustre
>>> ADIO driver. You might be interested.
>>>
>>> The main purpose of the new lustre adio driver is to collect the small
>>> size I/O requests into fewer number of larger ones according to
>>> Lustre's striping feature with low overhead automatically. We also
>>> provide some hints for performance tuning.
>>>
>>> In the following graph, the recent FLASH I/O benchmarking result shows
>>> that the new driver performs much better than the old one, especially
>>> for a large-scale system.
>>> In this test, FLASH I/O blocksize is 4*4*4 that means each 3 clients
>>> write 40KB, 40.5KB and 42KB respectively. The I/O operation iteration
>>> (nvar) is set to 1024.
>>>
>> Hi LiuYing
>>
>> Since this patch is for ROMIO and not really for Lustre, could you
>> send this to romio-maint at mcs.anl.gov ? It will reach more of the
>> ROMIO developers that way.
>>
>> Thanks
>> ==rob
>>
>>
>
--
Best regards,
LiuYing
System Software Engineer, Lustre Group
Sun Microsystems ( China ) Co. Limited
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090114/2505f635/attachment.htm>
More information about the lustre-discuss
mailing list