<div dir="ltr">Patrick,<div><br></div><div>You bring up an interesting point on read vs. write performance.  We can't use lfs_migrate control the stripe count used for writes (obviously), so that is left up to the application developer or at least the user to intelligently place shared access files in a directory with wider striping.  Restriping a file with lfs_migrate could change *read* performance characteristics, so there is indeed some risk there... but your work implies that is not too bad.  If we only restripe files that are "old", then the likelyhood that they will be read again goes way down, and balancing capacity used plays a bigger factor.  Bottom line is that I think restriping has more potential for upsides than down. :)</div><div><br></div><div>Thanks,</div><div>Nathan</div><div><br></div><div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 18, 2016 at 1:22 PM, Patrick Farrell <span dir="ltr"><<a href="mailto:paf@cray.com" target="_blank">paf@cray.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000">

    Nathan,<br>

    <br>

    This *is* excellent fodder for discussion.<br>

    <br>

    A few thoughts from a developer perspective.  When you stripe a file

    to multiple OSTs, you're spreading the data out across multiple

    targets, which (to my mind) has two purposes:<br>

    1) More even space usage across OSTs (mostly relevant for *really*

    big files, since in general, singly striped files are distributed

    across OSTs anyway)<br>

    2) Better bandwidth/parallelism for accesses to the file.<br>

    <br>

    The first one lends itself well to a file size based heuristic, but

    I'm not sure the second one does.  That's more about access

    patterns.  I'm not sure that you see much bandwidth benefit from

    striping with a single client, at least as long as an individual OST

    is fast relative to a client (increasingly common, I think, with

    flash and larger RAID arrays).  So then, whatever the file size, if

    it's accessed from one client, it should probably be single striped.<br>

    <br>

    Also, for shared files, client count relative to stripe count has a

    huge impact on write performance.  Assuming strided I/O patterns,

    anything more than 1 client per stripe/OST is actually worse than 1

    client.  (See my lock ahead presentation at LUG'15 for more on

    this.)  Read performance doesn't share this weirdness, though.<br>

    <br>

    All that's to say that for case 2 above, at least for writing, it's

    access pattern/access parallelism, not size, which matters.  I'm

    sure there's some correlation between file size and how parallel the

    access pattern is, but it might be very loose, and at least write

    performance doesn't scale linearly with stripe size.  Instead, the

    behavior is complex.<br>

    <br>

    So in order to pick an ideal striping with case 2 in mind, you

    really need to understand the application access pattern.  I can't

    see another way to do that goal justice.  (The Lustre ADIO in the

    MPI I/O library does this, partly by controlling the I/O pattern

    through I/O aggregation for collective I/Os.)<br>

    <br>

    So I think your tool can definitely help with case 1, not so sure

    about case 2.<br>

    <br>

    - Patrick<br>

    <br>

    <div>On 05/18/2016 12:22 PM, Nathan Dauchy -

      NOAA Affiliate wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div class="gmail_quote">

          <div dir="ltr">

            <div>

              <div>Greetings All,</div>

              <div><br>

              </div>

              <div>I'm looking for your experience and perhaps some

                lively discussion regarding "best practices" for

                choosing a file stripe count.  The Lustre manual has

                good tips on "Choosing a Stripe Size", and in practice

                the default 1M rarely causes problems on our systems.

                Stripe Count on the other hand is far more difficult to

                chose a single value that is efficient for a general

                purpose and multi-use site-wide file system.</div>

              <div><br>

              </div>

              <div>Since there is the "increased overhead" of striping,

                and weather applications do unfortunately write MANY

                tiny files, we usually keep the filesystem default

                stripe count at 1.  Unfortunately, there are several

                users who then write very large and shared-access files

                with that default.  I would like to be able to tell them

                to restripe... but without digging into the specific

                application and access pattern it is hard to know what

                count to recommend.  Plus there is the "stripe these but

                not those" confusion... it is common for users to have a

                few very large data files and many small log or output

                image files in the SAME directory.</div>

              <div><br>

              </div>

              <div>What do you all recommend as a reasonable rule of

                thumb that works for "most" user's needs, where stripe

                count can be determined based only on static data

                attributes (such as file size)?  I have heard a "stripe

                per GB" idea, but some have said that escalates to too

                many stripes too fast.  ORNL has a knowledge base

                article that says use a stripe count of "File size / 100

                GB", but does that make sense for smaller, non-DOE

                sites?  Would stripe count = Log2(size_in_GB)+1 be more

                generally reasonable?  For a 1 TB file, that actually

                works out to be similar to ORNL, only gets there more

                gradually:</div>

              <div>

                <div>    <a href="https://www.olcf.ornl.gov/kb_articles/lustre-basics/#Stripe_Count" target="_blank">https://www.olcf.ornl.gov/kb_articles/lustre-basics/#Stripe_Count</a><br>

                </div>

              </div>

              <div><br>

              </div>

              <div>Ideally, I would like to have a tool to give the

                users and say "go restripe your directory with this

                command" and it will do the right thing in 90% of

                cases.  See the rough patch to lfs_migrate (included

                below) which should help explain what I'm thinking. 

                Probably there are more efficient ways of doing things,

                but I have tested it lightly and it works as a

                proof-of-concept.</div>

              <div><br>

              </div>

              <div>With a good programmatic rule of thumb, we (as a

                Lustre community!) can eventually work with application

                developers to embed the stripe count selection into

                their code and get things at least closer to right up

                front.  Even if trial and error is involved to find the

                optimal setting, at least the rule of thumb can be a

                _starting_point_ for the users, and they can tweak it

                from there based on application, model, scale, dataset,

                etc.</div>

              <div><br>

              </div>

              <div>Thinking farther down the road, with progressive file

                layout, what algorithm will be used as the default?  If

                Lustre gets to the point where it can rebalance OST

                capacity behind the scenes, could it also make some

                intelligent choice about restriping very large files to

                spread out load and better balance capacity?  (Would

                that mean we need a bit set on the file to flag whether

                the stripe info was set specifically by the user or

                automatically by Lustre tools or it was just using the

                system default?)  Can the filesystem track concurrent

                access to a file, and perhaps migrate the file and

                adjust stripe count based on number of active clients?</div>

              <div><br>

              </div>

              <div>I appreciate any and all suggestions, clarifying

                questions, heckles, etc.  I know this is a lot of

                questions, and I certainly don't expect definitive

                answers on all of them, but I hope it is at least food

                for thought and discussion! :)</div>

              <div><br>

              </div>

              <div>Thanks,</div>

              <div>Nathan</div>

              <div><br>

              </div>

            </div>

            <div><br>

            </div>

            <div>

              <div>--- lfs_migrate-2.7.1<span style="white-space:pre-wrap"> </span>2016-05-13

                12:46:06.828032000 +0000</div>

              <div>+++ lfs_migrate.auto-count<span style="white-space:pre-wrap"> </span>2016-05-17

                21:37:19.036589000 +0000</div>

              <div>@@ -21,8 +21,10 @@</div>

              <div> </div>

              <div> usage() {</div>

              <div>     cat -- <<USAGE 1>&2</div>

              <div>-usage: lfs_migrate [-c <stripe_count>] [-h]

                [-l] [-n] [-q] [-R] [-s] [-y] [-0]</div>

              <div>+usage: lfs_migrate [-A] [-c <stripe_count>]

                [-h] [-l] [-n] [-q] [-R] [-s] [-v] [-y] [-0]</div>

              <div>                    [file|dir ...]</div>

              <div>+    -A restripe file using an automatically selected

                stripe count</div>

              <div>+       currently Stripe Count = Log2(size_in_GB)</div>

              <div>     -c <stripe_count></div>

              <div>        restripe file using the specified stripe

                count</div>

              <div>     -h show this usage message</div>

              <div>@@ -31,11 +33,11 @@</div>

              <div>     -q run quietly (don't print filenames or status)</div>

              <div>     -R restripe file using default directory

                striping</div>

              <div>     -s skip file data comparison after migrate</div>

              <div>+    -v be verbose and print information about each

                file</div>

              <div>     -y answer 'y' to usage question</div>

              <div>     -0 input file names on stdin are separated by a

                null character</div>

              <div> </div>

              <div>-The -c <stripe_count> option may not be

                specified at the same time as</div>

              <div>-the -R option.</div>

              <div>+Only one of the '-A', '-c', or '-R' options may be

                specified at a time.</div>

              <div> </div>

              <div> If a directory is an argument, all files in the

                directory are migrated.</div>

              <div> If no file/directory is given, the file list is read

                from standard input.</div>

              <div>@@ -48,15 +50,19 @@</div>

              <div> </div>

              <div> OPT_CHECK=y</div>

              <div> OPT_STRIPE_COUNT=""</div>

              <div>+OPT_AUTOSTRIPE=""</div>

              <div>+OPT_VERBOSE=""</div>

              <div> </div>

              <div>-while getopts "c:hlnqRsy0" opt $*; do</div>

              <div>+while getopts "Ac:hlnqRsvy0" opt $*; do</div>

              <div>     case $opt in</div>

              <div>+<span style="white-space:pre-wrap"> </span>A)

                OPT_AUTOSTRIPE=y;;</div>

              <div> <span style="white-space:pre-wrap"> </span>c)

                OPT_STRIPE_COUNT=$OPTARG;;</div>

              <div> <span style="white-space:pre-wrap"> </span>l)

                OPT_NLINK=y;;</div>

              <div> <span style="white-space:pre-wrap"> </span>n)

                OPT_DRYRUN=n; OPT_YES=y;;</div>

              <div> <span style="white-space:pre-wrap"> </span>q)

                ECHO=:;;</div>

              <div> <span style="white-space:pre-wrap"> </span>R)

                OPT_RESTRIPE=y;;</div>

              <div> <span style="white-space:pre-wrap"> </span>s)

                OPT_CHECK="";;</div>

              <div>+<span style="white-space:pre-wrap"> </span>v)

                OPT_VERBOSE=y;;</div>

              <div> <span style="white-space:pre-wrap"> </span>y)

                OPT_YES=y;;</div>

              <div> <span style="white-space:pre-wrap"> </span>0)

                OPT_NULL=y;;</div>

              <div> <span style="white-space:pre-wrap"> </span>h|\?)

                usage;;</div>

              <div>@@ -69,6 +75,16 @@</div>

              <div> <span style="white-space:pre-wrap"> </span>echo

                "$(basename $0) error: The -c <stripe_count>

                option may not" 1>&2</div>

              <div> <span style="white-space:pre-wrap"> </span>echo "be

                specified at the same time as the -R option."

                1>&2</div>

              <div> <span style="white-space:pre-wrap"> </span>exit 1</div>

              <div>+elif [ "$OPT_STRIPE_COUNT" -a "$OPT_AUTOSTRIPE" ];

                then</div>

              <div>+<span style="white-space:pre-wrap"> </span>echo ""</div>

              <div>+<span style="white-space:pre-wrap"> </span>echo

                "$(basename $0) error: The -c <stripe_count>

                option may not" 1>&2</div>

              <div>+<span style="white-space:pre-wrap"> </span>echo "be

                specified at the same time as the -A option."

                1>&2</div>

              <div>+<span style="white-space:pre-wrap"> </span>exit 1</div>

              <div>+elif [ "$OPT_AUTOSTRIPE" -a "$OPT_RESTRIPE" ]; then</div>

              <div>+<span style="white-space:pre-wrap"> </span>echo ""</div>

              <div>+<span style="white-space:pre-wrap"> </span>echo

                "$(basename $0) error: The -A option may not be

                specified at" 1>&2</div>

              <div>+<span style="white-space:pre-wrap"> </span>echo

                "the same time as the -R option." 1>&2</div>

              <div>+<span style="white-space:pre-wrap"> </span>exit 1</div>

              <div> fi</div>

              <div> </div>

              <div> if [ -z "$OPT_YES" ]; then</div>

              <div>@@ -107,7 +123,7 @@</div>

              <div> <span style="white-space:pre-wrap"> </span>$ECHO -n

                "$OLDNAME: "</div>

              <div> </div>

              <div> <span style="white-space:pre-wrap"> </span># avoid

                duplicate stat if possible</div>

              <div>-<span style="white-space:pre-wrap"> </span>TYPE_LINK=($(LANG=C

                stat -c "%h %F" "$OLDNAME" || true))</div>

              <div>+<span style="white-space:pre-wrap"> </span>TYPE_LINK=($(LANG=C

                stat -c "%h %F %s" "$OLDNAME" || true))</div>

              <div> </div>

              <div> <span style="white-space:pre-wrap"> </span># skip

                non-regular files, since they don't have any objects</div>

              <div> <span style="white-space:pre-wrap"> </span># and

                there is no point in trying to migrate them.</div>

              <div>@@ -127,11 +143,6 @@</div>

              <div> <span style="white-space:pre-wrap"> </span>continue</div>

              <div> <span style="white-space:pre-wrap"> </span>fi</div>

              <div> </div>

              <div>-<span style="white-space:pre-wrap"> </span>if [

                "$OPT_DRYRUN" ]; then</div>

              <div>-<span style="white-space:pre-wrap"> </span>echo -e

                "dry run, skipped"</div>

              <div>-<span style="white-space:pre-wrap"> </span>continue</div>

              <div>-<span style="white-space:pre-wrap"> </span>fi</div>

              <div>-</div>

              <div> <span style="white-space:pre-wrap"> </span>if [

                "$OPT_RESTRIPE" ]; then</div>

              <div> <span style="white-space:pre-wrap"> </span>UNLINK=""</div>

              <div> <span style="white-space:pre-wrap"> </span>else</div>

              <div>@@ -140,16 +151,43 @@</div>

              <div> <span style="white-space:pre-wrap"> </span># then

                we don't need to do this getstripe/mktemp stuff.</div>

              <div> <span style="white-space:pre-wrap"> </span>UNLINK="-u"</div>

              <div> </div>

              <div>-<span style="white-space:pre-wrap"> </span>[

                "$OPT_STRIPE_COUNT" ] && COUNT=$OPT_STRIPE_COUNT

                ||</div>

              <div>-<span style="white-space:pre-wrap"> </span>COUNT=$($LFS

                getstripe -c "$OLDNAME" \</div>

              <div>-<span style="white-space:pre-wrap"> </span>2>

                /dev/null)</div>

              <div> <span style="white-space:pre-wrap"> </span>SIZE=$($LFS

                getstripe $LFS_SIZE_OPT "$OLDNAME" \</div>

              <div> <span style="white-space:pre-wrap"> </span>      

                2> /dev/null)</div>

              <div>+<span style="white-space:pre-wrap"> </span>if [

                "$OPT_AUTOSTRIPE" ]; then</div>

              <div>+<span style="white-space:pre-wrap"> </span>FILE_SIZE=${TYPE_LINK[3]}<br>

              </div>

              <div>+<span style="white-space:pre-wrap"> </span># (math

                in bash is dumb, so depend on common tools, and there

                are options for that...)</div>

              <div>+<span style="white-space:pre-wrap"> </span># Stripe

                Count = Log2(size_in_GB)</div>

              <div>+<span style="white-space:pre-wrap"> </span>#COUNT=$(echo

                $FILE_SIZE | awk '{printf

                "%.0f\n",log($1/1024/1024/1024)/log(2)}')</div>

              <div>+<span style="white-space:pre-wrap"> </span>#COUNT=$(printf

                "%.0f\n" $(echo "l($FILE_SIZE/1024/1024/1024) / l(2)" |

                bc -l))</div>

              <div>+<span style="white-space:pre-wrap"> </span>COUNT=$(echo

                "l($FILE_SIZE/1024/1024/1024) / l(2) + 1" | bc -l | cut

                -d . -f 1)</div>

              <div>+<span style="white-space:pre-wrap"> </span># Stripe

                Count = size_in_GB</div>

              <div>+<span style="white-space:pre-wrap"> </span>#COUNT=$(echo

                "scale=0; $FILE_SIZE/1024/1024/1024" | bc -l | cut -d .

                -f 1)</div>

              <div>+<span style="white-space:pre-wrap"> </span>[

                "$COUNT" -lt 1 ] && COUNT=1</div>

              <div>+<span style="white-space:pre-wrap"> </span># (does

                it make sense to skip the file if old</div>

              <div>+<span style="white-space:pre-wrap"> </span># and

                new stripe count are identical?)</div>

              <div>+<span style="white-space:pre-wrap"> </span>else</div>

              <div>+<span style="white-space:pre-wrap"> </span>[

                "$OPT_STRIPE_COUNT" ] && COUNT=$OPT_STRIPE_COUNT

                ||</div>

              <div>+<span style="white-space:pre-wrap"> </span>COUNT=$($LFS

                getstripe -c "$OLDNAME" \</div>

              <div>+<span style="white-space:pre-wrap"> </span>2>

                /dev/null)</div>

              <div>+<span style="white-space:pre-wrap"> </span>fi</div>

              <div> </div>

              <div> <span style="white-space:pre-wrap"> </span>[ -z

                "$COUNT" -o -z "$SIZE" ] && UNLINK=""</div>

              <div>-<span style="white-space:pre-wrap"> </span>SIZE=${LFS_SIZE_OPT}${SIZE}</div>

              <div> <span style="white-space:pre-wrap"> </span>fi</div>

              <div> </div>

              <div>+<span style="white-space:pre-wrap"> </span>if [

                "$OPT_DRYRUN" ]; then</div>

              <div>+<span style="white-space:pre-wrap"> </span>if [

                "$OPT_VERBOSE" ]; then</div>

              <div>+<span style="white-space:pre-wrap"> </span>echo -e

                "dry run, would use count=${COUNT} size=${SIZE}"</div>

              <div>+<span style="white-space:pre-wrap"> </span>else</div>

              <div>+<span style="white-space:pre-wrap"> </span>echo -e

                "dry run, skipped"</div>

              <div>+<span style="white-space:pre-wrap"> </span>fi</div>

              <div>+<span style="white-space:pre-wrap"> </span>continue</div>

              <div>+<span style="white-space:pre-wrap"> </span>fi</div>

              <div>+<span style="white-space:pre-wrap"> </span>if [

                "$OPT_VERBOSE" ]; then</div>

              <div>+<span style="white-space:pre-wrap"> </span>echo -n

                "(count=${COUNT} size=${SIZE}) "</div>

              <div>+<span style="white-space:pre-wrap"> </span>fi</div>

              <div>+</div>

              <div>+<span style="white-space:pre-wrap"> </span>[

                "$SIZE" ] && SIZE=${LFS_SIZE_OPT}${SIZE}</div>

              <div>+</div>

              <div> <span style="white-space:pre-wrap"> </span># first

                try to migrate inside lustre</div>

              <div> <span style="white-space:pre-wrap"> </span># if

                failed go back to old rsync mode</div>

              <div> <span style="white-space:pre-wrap"> </span>if [[

                $RSYNC_MODE == false ]]; then</div>

            </div>

            <div><br>

            </div>

          </div>

        </div>

      </div>

      <br>

      <fieldset></fieldset>

      <br>

      <pre>_______________________________________________

lustre-discuss mailing list

<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a>

<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>

</pre>

    </blockquote>

    <br>

  </div>

<br>_______________________________________________<br>

lustre-discuss mailing list<br>

<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a><br>

<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>

<br></blockquote></div><br></div></div></div>