<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Thank you very much Andreas.</p>

    <p>Your explanation was very insightful.</p>

    <p>I do have the following questions/thoughts:</p>

    <p>Let's say I have 2 available OSTs, and 4MB of data. The

      stripe-size is 1MB. (Sizes are small for discussion purposes, I am

      trying to understand what solution -if any- would perform better

      in general)</p>

    <p>I would like to compare the following two strategies of

      writing/reading the data:</p>

    <p>A) I can store all the data in 1 single big lustre file, striped

      across the 2 OSTs.<br>

    </p>

    <p>B) I can create (e.g.) 4  smaller lustre files, each consisting

      of 1MB of data. Suppose I place them manually in the same way that

      they would be striped on strategy A.</p>

    <p>So the only difference between the 2 strategies is whether data

      is in a single lustre file or not (meaning I make sure each OST

      has a similar load in both cases).<br>

    </p>

    <p>Then:<br>

    </p>

    <p>Q1. Suppose I have 4 simultaneous processes, each wanting to read

      1MB of data. On strategy A, each process opens the file (via

      llapi_file_open) and then reads the corresponding data by

      calculating the offset from the start. On strategy B each process

      simply opens the corresponding file and reads its data. Would

      there be any difference in performance between the two strategies

      ?<br>

    </p>

    <p>Q2. Suppose I have 1 process, wanting to read the (e.g.)  3rd MB

      of data. Would strategy B be better, since it avoids the overhead

      of "skipping" to the offset that is required in strategy A ?</p>

    <p>Q3. For question 2, would the answer be different if the read is

      not aligned to the stripe-size? Meaning that in both strategies I

      would have to skip to an offset (compared to Q2 where I could just

      read the whole file in strategy B from the start), but in strategy

      A the skip is bigger.</p>

    <p>Q4. One concern I have regarding strategy A is that all the

      stripes of the file that are in the same OST are seen -internally-

      as one object (as per "Understanding Lustre Internals"). Does this

      affect performance when different, but not overlapping, parts of

      the file (that are on the same OST) are being accessed (for

      example due to locking)? Does it matter if the parts being

      accessed are on different "chunk", e.g 1st and 3rd MB on the above

      example?<br>

    </p>

    <p>Also if there are any additional docs I can read on those topics

      (apart from "Understanding Lustre internals") to get a better

      understanding, please do point them out.<br>

    </p>

    <p>Thanks again for your help,</p>

    <p>Apostolis<br>

    </p>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 9/23/24 00:42, Andreas Dilger wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:61062F8B-38EB-462E-9C05-60E5C7D1B914@whamcloud.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <div

style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

        class=""> On Sep 18, 2024, at 10:47, Apostolis Stamatis <<a

          href="mailto:el18034@mail.ntua.gr"

          class="moz-txt-link-freetext" moz-do-not-send="true">el18034@mail.ntua.gr</a>>

        wrote:

        <div class="">

          <blockquote type="cite" class="">

            <div class="">

              <div class="">I am trying to read/write a specific stripe

                for files striped across multiple OSTs. I've been

                looking around the C api but with no success so far.<br

                  class="">

                <br class="">

                <br class="">

                Let's say I have a big file which is striped across

                multiple OSTs. I have a cluster of compute nodes which

                perform some computation on the data of the file. Each

                node needs only a subset of that data.<br class="">

                <br class="">

                I want each node to be able to read/write only the

                needed information, so that all reads/writes can happen

                in parallel. The desired data may or may not be aligned

                with the stripes (this is secondary).<br class="">

                <br class="">

                It is my understanding that stripes are just parts of

                the file. Meaning that if I have an array of 100 rows

                and stripe A contains the first half, then it would

                contain the first 50 rows, is this correct?<br class="">

              </div>

            </div>

          </blockquote>

          <div class=""><br class="">

          </div>

          This is not totally correct.  The location of the data depends

          on the size of the data and the stripe size.</div>

        <div class=""><br class="">

        </div>

        <div class="">For a 1-stripe file (the default unless otherwise

          specified) then all of the data would be in a single object,

          regardless of the size of the data.</div>

        <div class=""><br class="">

        </div>

        <div class="">For a 2-stripe file with stripe_size=1MiB, then

          the first MB of data [0-1MB) is on object 0, the second MB of

          data [1-2MB) is on object 1, and the third MB of data [2-3MB)

          is back on object 0, etc.</div>

        <div class=""><br class="">

        </div>

        <div class="">See <a

href="https://wiki.lustre.org/Understanding_Lustre_Internals#Lustre_File_Layouts"

            class="moz-txt-link-freetext" moz-do-not-send="true">https://wiki.lustre.org/Understanding_Lustre_Internals#Lustre_File_Layouts</a> for

          example.</div>

        <div class=""><br class="">

          <blockquote type="cite" class="">

            <div class="">

              <div class="">To sum up my questions are:<br class="">

                <br class="">

                1) Can I read/write a specific stripe of a file via the

                C api to achieve better performance/locality?<br

                  class="">

              </div>

            </div>

          </blockquote>

          <div class=""><br class="">

          </div>

          There is no Lustre llapi_* interface that provides this

          functionality, but you can of course read the file with

          regular read() or preferably pread() or readv() calls with the

          right file offsets.  </div>

        <div class=""><br class="">

        </div>

        <div class="">

          <blockquote type="cite" class="">

            <div class="">

              <div class="">2) Is it correct that stripes include parts

                of the file, meaning the raw data? If not, can the raw

                data be extracted from any additional information stored

                in the stripe?<br class="">

              </div>

            </div>

          </blockquote>

          <div class=""><br class="">

          </div>

          <div class="">For example, if you have a 4-stripe file, then

            the application should read every 4th MB of the file to stay

            on the same OST object. Note that the *OST* index is not

            necessarily the same as the *stripe* number of the file.  To

            read the file from the local OST then it should check the

            local OST index and select that OST index from the file to

            determine the offset from the start of the file =

            stripe_size * stripe_number.</div>

          <div class=""><br class="">

          </div>

          <div class="">However, you could also do this more easily by

            having a bunch of 1-stripe files and doing the reads

            directly on the local OSTs.  You would run "lfs find DIR -i

            LOCAL_OST_IDX" to get a list of the files on each OST, and

            then process them directly.</div>

          <div class=""><br class="">

          </div>

          <blockquote type="cite" class="">

            <div class="">

              <div class="">3) If each compute node is run on top of a

                different OST where stripes of the file are stored,

                would it be better in terms of performance to have the

                node read the stripe of its OST? (because e.g. it avoids

                data transfer over the network)<br class="">

              </div>

            </div>

          </blockquote>

          <br class="">

        </div>

        <div class="">This is not necessarily needed, if you have a good

          network, but it depends on the workload.  Local PCI storage

          access is about the same speed as remote PCI network access

          because they are limited by the PCI bus bandwidth.  You would

          notice a difference is if you have a large number of clients

          and they are completely IO-bound that overwhelm the storage.</div>

        <br class="">

        <div class="">

          <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

            class="">

            <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

              class="">

              <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                class="">

                <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                  class="">

                  <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                    class="">

                    <div dir="auto"

style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"

                      class="">

                      <div class="">Cheers, Andreas</div>

                      <div class="">--</div>

                      <div class="">Andreas Dilger</div>

                      <div class="">Lustre Principal Architect</div>

                      <div class="">Whamcloud</div>

                      <div class=""><br class="">

                      </div>

                      <div class=""><br class="">

                      </div>

                      <div class=""><br class="">

                      </div>

                    </div>

                  </div>

                </div>

              </div>

            </div>

            <br class="Apple-interchange-newline">

          </div>

          <br class="Apple-interchange-newline">

          <br class="Apple-interchange-newline">

        </div>

        <br class="">

      </div>

    </blockquote>

  </body>

</html>