<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Hello,<br>

      <br>

      Coming back to this, I have proceeded with the one-file approach.<br>

      I am using a toy cluster with 1 combined MGS/MDT, 4 OSTs and 4

      clients, each client handling a different section of the file in

      parallel. Clients are running containerized in the same VM as the

      OST. <br>

      The file is striped across all 4 OSTs and a stripe size of 1MB is

      used (unless mentioned otherwise).<br>

      I am using different file sizes to measure performance, ranging

      from ~50MB to ~2.5GB. I am measuring end to end times for

      reading/writing the file.<br>

      <br>

      I have performed the following experiments:<br>

      <br>

      A) Using a variable size buffer of sizes 1MB, 2MB, 4MB to perform

      read/write calls. <br>

      B) To try and see if stripe alignment is beneficial, I aligned

      read/write calls so that they only handle one stripe. If I

      understand correctly, this means that each call is in the form

      `pwrite(fd, buffer, size, offset)` (same for pread), where offset

      is a multiple of stripe_size and size=stripe_size (buffer size =

      stripe_size). For this, stripe_size = buffer size =  1MB is used.<br>

      C) Without taking care of stripe alignment and a buffer of 1MB,

      try to determine if stripe_size is important by experimenting with

      the values stripe_size=65536, 655360, 6553600, 1MB.<br>

      <br>

      For a given file size, the results are almost identical for both

      read and write across all my experiments. <br>

      <br>

      My questions are:<br>

    </p>

    <p>Q1) Is the way I am trying to align calls with stripes (and in

      effect make sure each call only needs one OST) correct ?<br>

      Q2) If it is indeed correct, is it expected that I don't see any

      difference when aligning calls with stripes vs when I am not ?

      Based on our discussion and best practices I found online, I would

      expect that when alignment is taken into consideration performance

      is better.<br>

      Q3) Is it expected that I don't see any difference in performance

      using variable stripe sizes (with fixed size of read/write

      operations, namely 1MB) ?<br>

      Q4) Is it expected that I don't see any difference in performance

      using variable size of read/write operations (with fixed

      stripe_size

      1MB) ?<br>

      Q4) If the parameters mentioned should indeed affect performance,

      any idea what the reason might be that in my setup no difference

      is observed? E.g. I was thinking that MGS/MDT node could be slow

      and thus a bottleneck, or the files are too small to see any

      significant difference etc.<br>

      <br>

      Any additional things I might be missing to better understand what

      is going on?<br>

    </p>

    <div class="moz-cite-prefix">Thanks again for the help,<br>

      <br>

      Apostolis<br>

      <br>

      On 12/10/24 23:30, Andreas Dilger wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:D92CB071-B56F-4031-9542-49DCB3FCA928@whamcloud.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      On Sep 30, 2024, at 13:26, Apostolis Stamatis <<a

        href="mailto:el18034@mail.ntua.gr" class="moz-txt-link-freetext"

        moz-do-not-send="true">el18034@mail.ntua.gr</a>> wrote:<br

        class="">

      <div>

        <blockquote type="cite" class=""><br

            class="Apple-interchange-newline">

          <div class="">

            <div class="">

              <p class="">Thank you very much Andreas.</p>

              <p class="">Your explanation was very insightful.</p>

              <p class="">I do have the following questions/thoughts:</p>

              <p class="">Let's say I have 2 available OSTs, and 4MB of

                data. The stripe-size is 1MB. (Sizes are small for

                discussion purposes, I am trying to understand what

                solution -if any- would perform better in general)</p>

              <p class="">I would like to compare the following two

                strategies of writing/reading the data:</p>

              <p class="">A) I can store all the data in 1 single big

                lustre file, striped across the 2 OSTs.<br class="">

              </p>

              <p class="">B) I can create (e.g.) 4  smaller lustre

                files, each consisting of 1MB of data. Suppose I place

                them manually in the same way that they would be striped

                on strategy A.</p>

              <p class="">So the only difference between the 2

                strategies is whether data is in a single lustre file or

                not (meaning I make sure each OST has a similar load in

                both cases).<br class="">

              </p>

              <p class="">Then:<br class="">

              </p>

              <p class="">Q1. Suppose I have 4 simultaneous processes,

                each wanting to read 1MB of data. On strategy A, each

                process opens the file (via llapi_file_open) and then

                reads the corresponding data by calculating the offset

                from the start. On strategy B each process simply opens

                the corresponding file and reads its data. Would there

                be any difference in performance between the two

                strategies ?<br class="">

              </p>

            </div>

          </div>

        </blockquote>

        <div>For reading it is unlikely that there would be a

          significant difference in performance.  For writing, option A

          would be somewhat slower than B for large amounts of data,

          because there would be some lock contention between parallel

          writers to the same file.</div>

        <div><br class="">

        </div>

        <div>However, if this behavior is expanded to a large scale,

          then having millions or billions of 1MB files would have a

          different kind of overhead to open/close each file separately

          and having to manage so many those files vs. having

          fewer/larger files.  Given that a single client can read/write

          GB/s, it makes sense to aggregate enough data per file to

          amortize the overhead of the lookup/open/stat/close.</div>

        <div><br class="">

        </div>

        <div>Large-scale HPC applications try to pick a middle ground,

          for example having 1 file per checkpoint timestep written in

          parallel (instead of 1M separate per-CPU files), but each

          timestep (hourly) has a different file.  Alternately, each

          timestep could write individual files into a separate

          directory, if they are reasonably large (e.g. GB).</div>

        <div><br class="">

        </div>

        <blockquote type="cite" class="">

          <div class="">

            <div class="">

              <p class="">Q2. Suppose I have 1 process, wanting to read

                the (e.g.)  3rd MB of data. Would strategy B be better,

                since it avoids the overhead of "skipping" to the offset

                that is required in strategy A ?</p>

            </div>

          </div>

        </blockquote>

        <div>Seeking the offset pointer within a file has no cost.  That

          is just changing a number in the open file descriptor on the

          client, so it doesn't involve the servers or any kind of

          locking.</div>

        <div><br class="">

        </div>

        <blockquote type="cite" class="">

          <div class="">

            <div class="">

              <p class="">Q3. For question 2, would the answer be

                different if the read is not aligned to the stripe-size?

                Meaning that in both strategies I would have to skip to

                an offset (compared to Q2 where I could just read the

                whole file in strategy B from the start), but in

                strategy A the skip is bigger.</p>

            </div>

          </div>

        </blockquote>

        <div>Same answer as 2 - the seeking itself has no cost.  The

          *read* of unaligned data in this case is likely to be somewhat

          slower than reading aligned data (it may send RPCs to two

          OSTs, needing two separate locks, etc).  However, with any

          large-sized read (e.g. 8 MB+) it is unlikely to make a

          significant difference.</div>

        <br class="">

        <blockquote type="cite" class="">

          <div class="">

            <div class="">

              <p class="">Q4. One concern I have regarding strategy A is

                that all the stripes of the file that are in the same

                OST are seen -internally- as one object (as per

                "Understanding Lustre Internals"). Does this affect

                performance when different, but not overlapping, parts

                of the file (that are on the same OST) are being

                accessed (for example due to locking)? Does it matter if

                the parts being accessed are on different "chunk", e.g

                1st and 3rd MB on the above example?<br class="">

              </p>

            </div>

          </div>

        </blockquote>

        <div><br class="">

        </div>

        No, Lustre can allow concurrent read access to a single object

        from multiple threads/clients.  When writing the file, there can

        also be concurrent write access to a single object, but only

        with non-overlapping regions.  That would also be true if

        writing to separate files in option B (contention if two

        processes tried to write the same small file).</div>

      <div><br class="">

        <blockquote type="cite" class="">

          <div class="">

            <p class="">Also if there are any additional docs I can read

              on those topics (apart from "Understanding Lustre

              internals") to get a better understanding, please do point

              them out.</p>

          </div>

        </blockquote>

        <div>Patrick Farrell has presented at LAD and LUG a few times

          about optimizations to the IO pipeline, which may be

          interesting:</div>

        <div><a href="https://wiki.lustre.org/Lustre_User_Group_2022"

            class="moz-txt-link-freetext" moz-do-not-send="true">https://wiki.lustre.org/Lustre_User_Group_2022</a></div>

        <div>- <a

href="https://wiki.lustre.org/images/a/a3/LUG2022-Future_IO_Path-Farrell.pdf"

            class="moz-txt-link-freetext" moz-do-not-send="true">

https://wiki.lustre.org/images/a/a3/LUG2022-Future_IO_Path-Farrell.pdf</a></div>

        <div><a href="https://www.eofs.eu/index.php/events/lad-23/"

            class="moz-txt-link-freetext" moz-do-not-send="true">https://www.eofs.eu/index.php/events/lad-23/</a></div>

        <div>- <a

href="https://www.eofs.eu/wp-content/uploads/2024/02/04-LAD-2023-Unaligned-DIO.pdf"

            class="moz-txt-link-freetext" moz-do-not-send="true">

https://www.eofs.eu/wp-content/uploads/2024/02/04-LAD-2023-Unaligned-DIO.pdf</a></div>

        <div><a href="https://wiki.lustre.org/Lustre_User_Group_2024"

            class="moz-txt-link-freetext" moz-do-not-send="true">https://wiki.lustre.org/Lustre_User_Group_2024</a></div>

        <div>- <a

href="https://wiki.lustre.org/images/a/a0/LUG2024-Hybrid_IO_Path_Update-Farrell.pdf"

            class="moz-txt-link-freetext" moz-do-not-send="true">https://wiki.lustre.org/images/a/a0/LUG2024-Hybrid_IO_Path_Update-Farrell.pdf</a></div>

        <br class="">

        <blockquote type="cite" class="">

          <div class="">

            <p class="">Thanks again for your help,</p>

            <p class="">Apostolis<br class="">

            </p>

            <p class=""><br class="">

            </p>

            <div class="moz-cite-prefix">On 9/23/24 00:42, Andreas

              Dilger wrote:<br class="">

            </div>

            <blockquote type="cite"

cite="mid:61062F8B-38EB-462E-9C05-60E5C7D1B914@whamcloud.com" class="">

              <div

style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                class="">

                On Sep 18, 2024, at 10:47, Apostolis Stamatis <<a

                  href="mailto:el18034@mail.ntua.gr"

                  class="moz-txt-link-freetext" moz-do-not-send="true">el18034@mail.ntua.gr</a>>

                wrote:

                <div class="">

                  <blockquote type="cite" class="">

                    <div class="">

                      <div class="">I am trying to read/write a specific

                        stripe for files striped across multiple OSTs.

                        I've been looking around the C api but with no

                        success so far.<br class="">

                        <br class="">

                        <br class="">

                        Let's say I have a big file which is striped

                        across multiple OSTs. I have a cluster of

                        compute nodes which perform some computation on

                        the data of the file. Each node needs only a

                        subset of that data.<br class="">

                        <br class="">

                        I want each node to be able to read/write only

                        the needed information, so that all reads/writes

                        can happen in parallel. The desired data may or

                        may not be aligned with the stripes (this is

                        secondary).<br class="">

                        <br class="">

                        It is my understanding that stripes are just

                        parts of the file. Meaning that if I have an

                        array of 100 rows and stripe A contains the

                        first half, then it would contain the first 50

                        rows, is this correct?<br class="">

                      </div>

                    </div>

                  </blockquote>

                  <div class=""><br class="">

                  </div>

                  This is not totally correct.  The location of the data

                  depends on the size of the data and the stripe size.</div>

                <div class=""><br class="">

                </div>

                <div class="">For a 1-stripe file (the default unless

                  otherwise specified) then all of the data would be in

                  a single object, regardless of the size of the data.</div>

                <div class=""><br class="">

                </div>

                <div class="">For a 2-stripe file with stripe_size=1MiB,

                  then the first MB of data [0-1MB) is on object 0, the

                  second MB of data [1-2MB) is on object 1, and the

                  third MB of data [2-3MB) is back on object 0, etc.</div>

                <div class=""><br class="">

                </div>

                <div class="">See <a

href="https://wiki.lustre.org/Understanding_Lustre_Internals#Lustre_File_Layouts"

                    class="moz-txt-link-freetext" moz-do-not-send="true">https://wiki.lustre.org/Understanding_Lustre_Internals#Lustre_File_Layouts</a> for

                  example.</div>

                <div class=""><br class="">

                  <blockquote type="cite" class="">

                    <div class="">

                      <div class="">To sum up my questions are:<br

                          class="">

                        <br class="">

                        1) Can I read/write a specific stripe of a file

                        via the C api to achieve better

                        performance/locality?<br class="">

                      </div>

                    </div>

                  </blockquote>

                  <div class=""><br class="">

                  </div>

                  There is no Lustre llapi_* interface that provides

                  this functionality, but you can of course read the

                  file with regular read() or preferably pread() or

                  readv() calls with the right file offsets.  </div>

                <div class=""><br class="">

                </div>

                <div class="">

                  <blockquote type="cite" class="">

                    <div class="">

                      <div class="">2) Is it correct that stripes

                        include parts of the file, meaning the raw data?

                        If not, can the raw data be extracted from any

                        additional information stored in the stripe?<br

                          class="">

                      </div>

                    </div>

                  </blockquote>

                  <div class=""><br class="">

                  </div>

                  <div class="">For example, if you have a 4-stripe

                    file, then the application should read every 4th MB

                    of the file to stay on the same OST object. Note

                    that the *OST* index is not necessarily the same as

                    the *stripe* number of the file.  To read the file

                    from the local OST then it should check the local

                    OST index and select that OST index from the file to

                    determine the offset from the start of the file =

                    stripe_size * stripe_number.</div>

                  <div class=""><br class="">

                  </div>

                  <div class="">However, you could also do this more

                    easily by having a bunch of 1-stripe files and doing

                    the reads directly on the local OSTs.  You would run

                    "lfs find DIR -i LOCAL_OST_IDX" to get a list of the

                    files on each OST, and then process them directly.</div>

                  <div class=""><br class="">

                  </div>

                  <blockquote type="cite" class="">

                    <div class="">

                      <div class="">3) If each compute node is run on

                        top of a different OST where stripes of the file

                        are stored, would it be better in terms of

                        performance to have the node read the stripe of

                        its OST? (because e.g. it avoids data transfer

                        over the network)<br class="">

                      </div>

                    </div>

                  </blockquote>

                  <br class="">

                </div>

                <div class="">This is not necessarily needed, if you

                  have a good network, but it depends on the workload.

                   Local PCI storage access is about the same speed as

                  remote PCI network access because they are limited by

                  the PCI bus bandwidth.  You would notice a difference

                  is if you have a large number of clients and they are

                  completely IO-bound that overwhelm the storage.</div>

                <br class="">

                <div class="">

                  <div dir="auto"

style="caret-color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                    class="">

                    <div dir="auto"

style="caret-color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                      class="">

                      <div dir="auto"

style="caret-color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                        class="">

                        <div dir="auto"

style="caret-color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                          class="">

                          <div dir="auto"

style="caret-color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                            class="">

                            <div dir="auto"

style="caret-color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                              class="">

                              <div class="">Cheers, Andreas</div>

                              <div class="">--</div>

                              <div class="">Andreas Dilger</div>

                              <div class="">Lustre Principal Architect</div>

                              <div class="">Whamcloud</div>

                              <div class=""><br class="">

                              </div>

                              <div class=""><br class="">

                              </div>

                              <div class=""><br class="">

                              </div>

                            </div>

                          </div>

                        </div>

                      </div>

                    </div>

                    <br class="Apple-interchange-newline">

                  </div>

                  <br class="Apple-interchange-newline">

                  <br class="Apple-interchange-newline">

                </div>

                <br class="">

              </div>

            </blockquote>

          </div>

        </blockquote>

      </div>

      <br class="">

      <div class="">

        <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

          class="">

          <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

            class="">

            <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

              class="">

              <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                class="">

                <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                  class="">

                  <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                    class="">

                    <div>Cheers, Andreas</div>

                    <div>--</div>

                    <div>Andreas Dilger</div>

                    <div>Lustre Principal Architect</div>

                    <div>Whamcloud</div>

                    <div><br class="">

                    </div>

                    <div><br class="">

                    </div>

                    <div><br class="">

                    </div>

                  </div>

                </div>

              </div>

            </div>

          </div>

          <br class="Apple-interchange-newline">

        </div>

        <br class="Apple-interchange-newline">

        <br class="Apple-interchange-newline">

      </div>

      <br class="">

    </blockquote>

  </body>

</html>