[Lustre-discuss] Bad read performance

Fri Aug 21 06:18:41 PDT 2009

they run on different physical nodes and access the ost via 4x infiniband.

On Fri, Aug 21, 2009 at 3:15 PM, di wang <di.wang at sun.com> wrote:

> Alvaro Aguilera wrote:
>
>> thanks for the hint, but unfortunately I can't make any updates to the
>> cluster...
>>
>> Do you think both of the problems I experienced are bugs in Lustre and are
>> resolved in current versions?
>>
> It should be lustre bugs. The 2 processes runs on different node or same
> node?
>
> Thanks
> WangDi
>
>>
>> Thanks.
>> Alvaro.
>>
>>
>> On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com <mailto:
>> di.wang at sun.com>> wrote:
>>
>>    Hello,
>>
>>    You may see bug 17197 and try to apply this patch
>>    https://bugzilla.lustre.org/attachment.cgi?id=25062  to your
>>    lustre src. Or you can wait 1.8.2.
>>
>>    Thanks
>>    Wangdi
>>
>>    Alvaro Aguilera wrote:
>>
>>        Hello,
>>
>>        as a project for college I'm doing a behavioral comparison
>>        between Lustre and CXFS when dealing with simple strided files
>>        using POSIX semantics. On one of the tests, each participating
>>        process reads 16 chunks of data with a size of 32MB each, from
>>        a common, strided file using the following code:
>>
>>
>>  ------------------------------------------------------------------------------------------
>>        int myfile = open("thefile", O_RDONLY);
>>
>>        MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help
>>        measuring time
>>
>>        off_t distance = (numtasks-1)*p.buffersize;
>>        off_t offset = rank*p.buffersize;
>>
>>        int j;
>>        lseek(myfile, offset, SEEK_SET);
>>        for (j = 0; j < p.buffercount; j++) {
>>              read(myfile, buffers[j], p.buffersize); // buffers are
>>        aligned to the page size
>>              lseek(myfile, distance, SEEK_CUR);
>>        }
>>
>>        MPI_Barrier(MPI_COMM_WORLD);
>>
>>        close(myfile);
>>
>>  ------------------------------------------------------------------------------------------
>>
>>        I'm facing the following problem: when this code is run in
>>        parallel the read operations on certain processes start to
>>        need more and more time to complete. I attached a graphical
>>        trace of this, when using only 2 processes.
>>        As you see, the read operations on process 0 stay more or less
>>        constant, taking about 0.12 seconds to complete, while on
>>        process 1 they increase up to 39 seconds!
>>
>>        If I run the program with only one process, then the time
>>        stays at ~0.12 seconds per read operation. The problem doesn't
>>        appear if the O_DIRECT flag is used.
>>
>>        Can somebody explain to me why is this happening? Since I'm
>>        very new to Lustre, I may be making some silly mistakes, so be
>>        nice to me ;)
>>
>>        I'm using Lustre SLES 10 Patchlevel 1, Kernel
>>        2.6.16.54-0.2.5_lustre.1.6.5.1.
>>
>>
>>        Thanks!
>>
>>        Alvaro Aguilera.
>>
>>
>>
>>  ------------------------------------------------------------------------
>>
>>
>>  ------------------------------------------------------------------------
>>
>>
>>
>>        _______________________________________________
>>        Lustre-discuss mailing list
>>        Lustre-discuss at lists.lustre.org
>>        <mailto:Lustre-discuss at lists.lustre.org>
>>        http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090821/895d7abf/attachment.htm>