<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>I reran the job, changing only --ioengine=posixaio to

      --ioengine=libaio.  Since I don't intercept the functions in

      libaio I don't see much for application calls except for the open

      and close.  But the plot of osc cache usage versus time gives a

      good feel for what is going on.  The first plot below is the osc

      cache usage when using libaio.  The second plot below is the osc

      cache usage when using the pthread aio handler (paio).  I should

      point out that the pthread aio handler I'm using is not the one in

      librt.so, but rather one that I wrote. (My apologies on the

      previous email for my having pasted the wrong URL for plot 3 which

      is the second plot in this email ).  Note that the libaio run took

      50 seconds, and the pthread aio handler took 28 seconds.</p>

    <p>This time, for completeness, I ran the job using

      --ioengine=posixaio, but using the aio calls in librt.so.  The

      third plot below is the osc cache usage for this run.  The I/O

      phase here took 48 seconds.  The fourth plot below is the file

      position activity for this run.  Note that again there are the

      pauses that seem to stop all progress.  Perhaps since the job is

      running considerably slower, the write behind is keeping up with

      the dirty data being generated and we see fewer pauses than the

      initial run.  I will investigate the max_dirty_mb and

      max_rpcs_in_flight angle and send an update later.</p>

    <p>I have no complaints about the performance in general when using

      my pthread aio handler.  I am mostly curious about the source of

      the pauses.  They always show up, but not in any regular pattern.

      Sometimes the long pauses have very few aio requests between them,

      sometimes not.<br>

    </p>

    <p>John</p>

    <p>libaio.osc_cached :

      <a class="moz-txt-link-freetext" href="https://www.dropbox.com/s/7z8taqr2u9g139l/libaio.osc_cached.png?dl=0">https://www.dropbox.com/s/7z8taqr2u9g139l/libaio.osc_cached.png?dl=0</a></p>

    <p><img moz-do-not-send="false"

        src="cid:part1.2ig6Emxc.twUnNDM4@iodoctors.com"

alt="https://www.dropbox.com/s/7z8taqr2u9g139l/libaio.osc_cached.png?dl=0"

        width="1067" height="680"></p>

    <p><br>

    </p>

    <p>paio.osc_cached :

      <a class="moz-txt-link-freetext" href="https://www.dropbox.com/s/l6bq8rpz6ij7vdo/paio.osc_cached.png?dl=0">https://www.dropbox.com/s/l6bq8rpz6ij7vdo/paio.osc_cached.png?dl=0</a></p>

    <p><img moz-do-not-send="false"

        src="cid:part2.ldjEGtNd.FTwu3jy7@iodoctors.com"

        alt="https://www.dropbox.com/s/l6bq8rpz6ij7vdo/paio.osc_cached.png?dl=0"

        width="1067" height="680"></p>

    <p><br>

    </p>

    <p>librt.osc_cached :

      <a class="moz-txt-link-freetext" href="https://www.dropbox.com/s/4lh0dz9b286ci2l/librt.osc_cached.png?dl=0">https://www.dropbox.com/s/4lh0dz9b286ci2l/librt.osc_cached.png?dl=0</a></p>

    <p><img moz-do-not-send="false"

        src="cid:part3.F7WJ8GpK.pxoo1R0v@iodoctors.com"

alt="https://www.dropbox.com/s/4lh0dz9b286ci2l/librt.osc_cached.png?dl=0"

        width="1067" height="681"></p>

    <p>librt.fpa :

      <a class="moz-txt-link-freetext" href="https://www.dropbox.com/s/5sy67ol3gdl49rg/librt.fpa.png?dl=0">https://www.dropbox.com/s/5sy67ol3gdl49rg/librt.fpa.png?dl=0</a></p>

    <img moz-do-not-send="false"

      src="cid:part4.deX4aaVl.ce49Hyoq@iodoctors.com"

      alt="https://www.dropbox.com/s/5sy67ol3gdl49rg/librt.fpa.png?dl=0"

      width="1067" height="681"><br>

    <div class="moz-cite-prefix">On 8/25/22 18:37, Andreas Dilger wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:D8D6837D-0F4E-4C17-8F07-F0B202E38664@ddn.com">

      No comment on the actual performance issue, but we normally test

      fio using the libaio interface (which is handled in the kernel)

      instead of posixaio (which is handled by threads in userspace,

      AFAIK), and also use DirectIO to avoid memory copies (OK if there

      are enough IO requests in flight).  it should be a relatively easy

      change to see if that improves the behaviour.

      <div class=""><br class="">

      </div>

      <div class="">Other things to check - osc.*.max_dirty_mb and

        llite.*.max_cached_mb are not hitting limits and throttling IO

        until the data is flushed, and osc.*.max_rpcs_in_flight across

        the OSCs are *at least* 64 to keep up with the input generation.</div>

      <div class=""><br class="">

      </div>

      <div class="">When you consider that Lustre (distributed coherent

        persistent network filesystem) is "only half" of the performance

        (28s vs 13s) of a local ephemeral RAM-based filesystem, it isn't

        doing too badly...</div>

      <div class=""><br class="">

      </div>

      <div class="">Cheers, Andreas<br class="">

        <div><br class="">

          <blockquote type="cite" class="">

            <div class="">On Aug 25, 2022, at 11:29, John Bauer <<a

                href="mailto:bauerj@iodoctors.com"

                class="moz-txt-link-freetext" moz-do-not-send="true">bauerj@iodoctors.com</a>>

              wrote:</div>

            <br class="Apple-interchange-newline">

            <div class="">

              <div class="">Hi all,<br class="">

                <br class="">

                I'm trying to figure out an odd behavior when running an

                fio ( <a href="https://git.kernel.dk/cgit/fio/"

                  class="moz-txt-link-freetext" moz-do-not-send="true">

                  https://git.kernel.dk/cgit/fio/</a> <<a

                  href="https://git.kernel.dk/cgit/fio/"

                  class="moz-txt-link-freetext" moz-do-not-send="true">https://git.kernel.dk/cgit/fio/</a>>

                ) benchmark on a Lustre file system.<br class="">

                <br class="">

                fio--randrepeat=1  \<br class="">

                   --ioengine=posixaio  \<br class="">

                   --buffered=1  \<br class="">

                   --gtod_reduce=1  \<br class="">

                   --name=test  \<br class="">

                   --filename=${fileName}  \<br class="">

                   --bs=1M  \<br class="">

                   --iodepth=64  \<br class="">

                   --size=40G  \<br class="">

                   --readwrite=randwrite<br class="">

                <br class="">

                In short, the application queues 40,000 random

                aio_write64(nbyte=1M) to a maximum depth of 64, doing

                aio_suspend64 followed by aio_write to keep 64

                outstanding aio requests.  My I/O library that processes

                the aio requests does so with 4 pthreads removing aio

                requests from the queue and doing the I/Os as

                pwrite64()s.  The odd behavior is the intermittent

                pauses that can been seen in the first plot below.  The

                X-axis is wall clock time, in seconds, and the left

                Y-axis is file postition. The horizontal blue lines

                indicate the amount of time each of the pwrite64 is

                active and where in the file the I/O is occurring. The

                right Y-axis is the cumulative cpu times for both the

                process and kernel during the run.  There is minimal

                user cpu time, for either the process or kernel.  The

                cumulative system cpu time attributable to the process (

                the red line ) runs at a slope of ~4 system cpu seconds

                per wall clock second.  Makes sense since there are 4

                pthreads at work in the user process.  The cumulative

                system cpu time for the kernel as a whole ( the green

                line ) is ~12 system cpu seconds per wall clock second. 

                Note that during the pauses the system cpu accumulation

                drops to near zero ( zero slope ).<br class="">

                <br class="">

                This is run on a dedicated ivybridge node with 40 cores

                Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz<br class="">

                <br class="">

                The node has 64G of memory.<br class="">

                <br class="">

                The file is striped single component PFL, 8x1M.  Lustre

                version *2.12.8 ddn12*<br class="">

                <br class="">

                Does anyone have any ideas what is causing the pauses? 

                Is there something else I could be looking at in the

                /proc file system to gain insight?<br class="">

                <br class="">

                For comparison, the 2nd plot below is when run on /tmp. 

                Note that there are some pwrite64() that take a long

                time, but a single pwrite64() taking a long time does

                not stop all the other pwrite64() active during the same

                time period.  Elapsed time for /tmp is 13 seconds.

                Lustre is 28 seconds.  Both are essentially memory

                resident.<br class="">

                <br class="">

                Just for completeness I have added a 3rd plot which is

                the amount of memory each of the OSC clients is

                consuming over the course of the Lustre run.  Nothing

                unusual there.  The memory consumption rate slows down

                during the pauses as one would expect.<br class="">

                <br class="">

                I don't think the instrumentation is the issue, as there

                is not much more instrumentation occurring in the Lustre

                run versus /tmp, and they are both less than 6MB each in

                total.<br class="">

                <br class="">

                John<br class="">

                <br class="">

                In case the images got stripped here are some URLs to

                dropbox<br class="">

                <br class="">

                plot1 : <a

                  href="https://www.dropbox.com/s/ey217o053gdyse5/plot1.png?dl=0"

                  class="moz-txt-link-freetext" moz-do-not-send="true">

https://www.dropbox.com/s/ey217o053gdyse5/plot1.png?dl=0</a><br class="">

                <br class="">

                plot2 : <a

                  href="https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0"

                  class="moz-txt-link-freetext" moz-do-not-send="true">

https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0</a><br class="">

                <br class="">

                plot3 : <a

                  href="https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0"

                  class="moz-txt-link-freetext" moz-do-not-send="true">

https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0</a><br class="">

                <br class="">

                <br class="">

                <br class="">

                <br class="">

                <br class="">

                _______________________________________________<br

                  class="">

                lustre-discuss mailing list<br class="">

                <a href="mailto:lustre-discuss@lists.lustre.org"

                  class="moz-txt-link-freetext" moz-do-not-send="true">lustre-discuss@lists.lustre.org</a><br

                  class="">

<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br

                  class="">

              </div>

            </div>

          </blockquote>

        </div>

        <br class="">

        <div class="">

          <div dir="auto" class="">

            <div dir="auto" class="">

              <div dir="auto" class="">

                <div dir="auto" class="">

                  <div dir="auto" class="">

                    <div dir="auto" class="">

                      <div>Cheers, Andreas</div>

                      <div>--</div>

                      <div>Andreas Dilger</div>

                      <div>Lustre Principal Architect</div>

                      <div>Whamcloud</div>

                      <div><br class="">

                      </div>

                      <div><br class="">

                      </div>

                      <div><br class="">

                      </div>

                    </div>

                  </div>

                </div>

              </div>

            </div>

            <br class="Apple-interchange-newline">

          </div>

          <br class="Apple-interchange-newline">

          <br class="Apple-interchange-newline">

        </div>

        <br class="">

      </div>

    </blockquote>

  </body>

</html>