<div dir="ltr"><div>Greetings All,</div><div><br></div><div>After looking at this topic further, and discussions with a colleague at NASA, I could be convinced to be more aggressive in REstriping files wider with "lfs_migrate -A".  I would like to know if anyone has recent benchmark results or analysis to support or refute the following...</div><div><br></div><div>When using lfs_migrate, each file is handled with a single process, so the multi-client writing problem identified by Patrick does not apply.  Furthermore, if a user is migrating files to a "hot" tier of storage, they presumably know how that data set will be used and should specify the stripe count based on future application read access pattern.  In other cases (such as capacity balancing), the performance of lfs_migrate is probably not critical, so the bottom line is that we should not auto-select stripe count based on *write* performance.</div><div><br></div><div>I have searched around for metrics to show whether *read* performance tails off with number of stripes and/or clients at some point.  Also relevant would be data to define just how much the increased overhead of each stripe actually effects metadata operations(particularly "ls -l").  With those numbers, we could make a more informed decision about the algorithm to use for "lfs_migrate -A" in LU-8207.</div><div><br></div><div>* Some good data from my colleague at NASA is in <a href="http://people.nas.nasa.gov/~kolano/papers/hpdic13.pdf">http://people.nas.nasa.gov/~kolano/papers/hpdic13.pdf</a> and shows stat operations clearly getting slower with stripe count, but I'm wondering if that might be outdated based on more recent MDS threading performance improvements.  That paper also shows multi-client read performance improving up to about 16 stripes, then leveling off.</div><div><br></div><div>* This paper shows single-client read performance degrading primarily after 16 or 32 stripes:</div><div><a href="https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/13A-Crosby/LCROSBY-PAPER.pdf">https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/13A-Crosby/LCROSBY-PAPER.pdf</a></div><div><br></div><div>* Another reference is at <a href="http://wiki.opensfs.org/MDS_SMP_Node_Affinity_FinalReport_wiki_version">http://wiki.opensfs.org/MDS_SMP_Node_Affinity_FinalReport_wiki_version</a> ...but it lacks metadata read operations that are the most critical after migrating existing data.  (There is no degrading of Opencreate and Unlink IOPs up to 4 stripes though.)</div><div><br></div><div>Therefore, the actual considerations on selecting stripe count when REstriping files are:</div><div>  * OST capacity and load balancing (more stripes are always better?)</div><div>  * Metadata performance primarily read ops. (progressively worse with more than ~4 stripes)</div><div>  * Single-client read performance (degrades slightly with more stripes?)</div><div>  * Multi-client read performance (more stripes are better up to a point, then performance degrades?)</div><div><br></div><div>Possibly something like "stripe per GB up to 16 stripes, then stripe per 100 GB up to number of OSTs" is better than the "Log2()" algorithm after all?  Can we even do stripe per 0.5 GB?  What data is available to determine whether 100 GB is the right value, or should it be the 1% of smallest OST as already proposed for <a href="http://review.whamcloud.com/#/c/20552/">http://review.whamcloud.com/#/c/20552/</a> ?</div><div><br></div><div>Thanks,<br></div><div>Nathan</div><div><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 18, 2016 at 1:30 PM, Dilger, Andreas <span dir="ltr"><<a href="mailto:andreas.dilger@intel.com" target="_blank">andreas.dilger@intel.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

<div style="word-wrap:break-word;color:rgb(0,0,0);font-size:14px;font-family:Calibri,sans-serif">

<div>

<div>

<div><br>

</div>

<div>

<div>

<div>

<div>Cheers, Andreas</div>

<div>-- </div>

<div>Andreas Dilger</div>

</div>

<div>Lustre Principal Architect</div>

<div>Intel High Performance Data Division</div>

</div>

</div>

</div>

</div>

<div><br>

</div>

<span>

<div>

<div>On 2016/05/18, 11:22, "Nathan Dauchy - NOAA Affiliate" <<a href="mailto:nathan.dauchy@noaa.gov" target="_blank">nathan.dauchy@noaa.gov</a>> wrote:</div>

</div>

<div><br>

</div>

<blockquote style="BORDER-LEFT:#b5c4df 5 solid;PADDING:0 0 0 5;MARGIN:0 0 0 5">

<div>

<div>

<div dir="ltr">

<div class="gmail_quote">

<div dir="ltr">

<div>

<div>Greetings All,</div>

<div><br>

</div>

<div>I'm looking for your experience and perhaps some lively discussion regarding "best practices" for choosing a file stripe count.  The Lustre manual has good tips on "Choosing a Stripe Size", and in practice the default 1M rarely causes problems on our systems.

 Stripe Count on the other hand is far more difficult to chose a single value that is efficient for a general purpose and multi-use site-wide file system.</div>

<div><br>

</div>

<div>Since there is the "increased overhead" of striping, and weather applications do unfortunately write MANY tiny files, we usually keep the filesystem default stripe count at 1.  Unfortunately, there are several users who then write very large and shared-access

 files with that default.  I would like to be able to tell them to restripe... but without digging into the specific application and access pattern it is hard to know what count to recommend.  Plus there is the "stripe these but not those" confusion... it is

 common for users to have a few very large data files and many small log or output image files in the SAME directory.</div>

</div>

</div>

</div>

</div>

</div>

</div>

</blockquote>

</span>

<div><br>

</div>

<div>This is exactly what the ORNL "Progressive File Layout" (PFL) project is about.  Automatically increase the stripe size of a file as the size grows.  That will allow a single default layout to describe both small and large files, and go from e.g. 1 stripe

 to 8 stripes to 256 stripes as the size increases.</div>

<div><br>

</div>

<span>

<blockquote style="BORDER-LEFT:#b5c4df 5 solid;PADDING:0 0 0 5;MARGIN:0 0 0 5">

<div>

<div>

<div dir="ltr">

<div class="gmail_quote">

<div dir="ltr">

<div>

<div>What do you all recommend as a reasonable rule of thumb that works for "most" user's needs, where stripe count can be determined based only on static data attributes (such as file size)?  I have heard a "stripe per GB" idea, but some have said that escalates

 to too many stripes too fast.  ORNL has a knowledge base article that says use a stripe count of "File size / 100 GB", but does that make sense for smaller, non-DOE sites?  Would stripe count = Log2(size_in_GB)+1 be more generally reasonable?  For a 1 TB file,

 that actually works out to be similar to ORNL, only gets there more gradually:</div>

<div>

<div>    <a href="https://www.olcf.ornl.gov/kb_articles/lustre-basics/#Stripe_Count" target="_blank">

https://www.olcf.ornl.gov/kb_articles/lustre-basics/#Stripe_Count</a></div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</blockquote>

</span>

<div><br>

</div>

<div>Using the log2() value seems reasonable.</div>

<div><br>

</div>

<span>

<blockquote style="BORDER-LEFT:#b5c4df 5 solid;PADDING:0 0 0 5;MARGIN:0 0 0 5">

<div>

<div>

<div dir="ltr">

<div class="gmail_quote">

<div dir="ltr">

<div>

<div>Ideally, I would like to have a tool to give the users and say "go restripe your directory with this command" and it will do the right thing in 90% of cases.  See the rough patch to lfs_migrate (included below) which should help explain what I'm thinking. 

 Probably there are more efficient ways of doing things, but I have tested it lightly and it works as a proof-of-concept.</div>

</div>

</div>

</div>

</div>

</div>

</div>

</blockquote>

</span>

<div><br>

</div>

<div>I'd welcome this as a patch submitted to Gerrit.</div>

<div><br>

</div>

<span>

<blockquote style="BORDER-LEFT:#b5c4df 5 solid;PADDING:0 0 0 5;MARGIN:0 0 0 5">

<div dir="ltr">

<div class="gmail_quote">

<div dir="ltr">

<div>

<div>With a good programmatic rule of thumb, we (as a Lustre community!) can eventually work with application developers to embed the stripe count selection into their code and get things at least closer to right up front.  Even if trial and error is involved

 to find the optimal setting, at least the rule of thumb can be a _starting_point_ for the users, and they can tweak it from there based on application, model, scale, dataset, etc.</div>

<div><br>

</div>

<div>Thinking farther down the road, with progressive file layout, what algorithm will be used as the default?</div>

</div>

</div>

</div>

</div>

</blockquote>

</span>

<div><br>

</div>

<div>To be clear, the PFL implementation does not currently have an algorithmic layout, rather a series of thresholds based on file size that will select different layouts (initially stripe counts, but could be anything including stripe size, OST pools, etc).

  The PFL size thresholds and stripe counts _could_ be set up (manually) as as a geometric series, but they can also be totally arbitrary if you want.</div>

<div><br>

</div>

<span>

<blockquote style="BORDER-LEFT:#b5c4df 5 solid;PADDING:0 0 0 5;MARGIN:0 0 0 5">

<div dir="ltr">

<div class="gmail_quote">

<div dir="ltr">

<div>

<div>If Lustre gets to the point where it can rebalance OST capacity behind the scenes, could it also make some intelligent choice about restriping very large files to spread out load and better balance capacity?  (Would that mean we need a bit set on the file

 to flag whether the stripe info was set specifically by the user or automatically by Lustre tools or it was just using the system default?)  Can the filesystem track concurrent access to a file, and perhaps migrate the file and adjust stripe count based on

 number of active clients?</div>

</div>

</div>

</div>

</div>

</blockquote>

</span>

<div><br>

</div>

<div>I think this would be an interesting task for RobinHood, since it already has much of this information.  It could find large files with low stripe counts and restripe them during OST rebalancing.</div>

<div><br>

</div>

<span>

<blockquote style="BORDER-LEFT:#b5c4df 5 solid;PADDING:0 0 0 5;MARGIN:0 0 0 5">

<div dir="ltr">

<div class="gmail_quote">

<div dir="ltr">

<div>

<div>I appreciate any and all suggestions, clarifying questions, heckles, etc.  I know this is a lot of questions, and I certainly don't expect definitive answers on all of them, but I hope it is at least food for thought and discussion! :)</div>

</div>

</div>

</div>

</div>

</blockquote>

</span>

<div><br>

</div>

<div>One last comment on the patch below:</div>

<div><br>

</div>

<span>

<blockquote style="BORDER-LEFT:#b5c4df 5 solid;PADDING:0 0 0 5;MARGIN:0 0 0 5">

<div dir="ltr">

<div class="gmail_quote">

<div dir="ltr">

<div>

<div>--- lfs_migrate-2.7.1<span style="white-space:pre-wrap"> </span>2016-05-13 12:46:06.828032000 +0000</div>

</div>

<div>

<div>+++ lfs_migrate.auto-count<span style="white-space:pre-wrap"> </span>2016-05-17 21:37:19.036589000 +0000</div>

<div>@@ -21,8 +21,10 @@</div>

<div> </div>

<div> usage() {</div>

<div>     cat -- <<USAGE 1>&2</div>

<div>-usage: lfs_migrate [-c <stripe_count>] [-h] [-l] [-n] [-q] [-R] [-s] [-y] [-0]</div>

<div>+usage: lfs_migrate [-A] [-c <stripe_count>] [-h] [-l] [-n] [-q] [-R] [-s] [-v] [-y] [-0]</div>

<div>                    [file|dir ...]</div>

<div>+    -A restripe file using an automatically selected stripe count</div>

<div>+       currently Stripe Count = Log2(size_in_GB)</div>

<div>     -c <stripe_count></div>

<div>        restripe file using the specified stripe count</div>

<div>     -h show this usage message</div>

<div>@@ -31,11 +33,11 @@</div>

<div>     -q run quietly (don't print filenames or status)</div>

<div>     -R restripe file using default directory striping</div>

<div>     -s skip file data comparison after migrate</div>

<div>+    -v be verbose and print information about each file</div>

<div>     -y answer 'y' to usage question</div>

<div>     -0 input file names on stdin are separated by a null character</div>

<div> </div>

<div>-The -c <stripe_count> option may not be specified at the same time as</div>

<div>-the -R option.</div>

<div>+Only one of the '-A', '-c', or '-R' options may be specified at a time.</div>

<div> </div>

<div> If a directory is an argument, all files in the directory are migrated.</div>

<div> If no file/directory is given, the file list is read from standard input.</div>

<div>@@ -48,15 +50,19 @@</div>

<div> </div>

<div> OPT_CHECK=y</div>

<div> OPT_STRIPE_COUNT=""</div>

<div>+OPT_AUTOSTRIPE=""</div>

<div>+OPT_VERBOSE=""</div>

<div> </div>

<div>-while getopts "c:hlnqRsy0" opt $*; do</div>

<div>+while getopts "Ac:hlnqRsvy0" opt $*; do</div>

<div>     case $opt in</div>

<div>+<span style="white-space:pre-wrap"> </span>A) OPT_AUTOSTRIPE=y;;</div>

<div> <span style="white-space:pre-wrap"> </span>c) OPT_STRIPE_COUNT=$OPTARG;;</div>

<div> <span style="white-space:pre-wrap"> </span>l) OPT_NLINK=y;;</div>

<div> <span style="white-space:pre-wrap"> </span>n) OPT_DRYRUN=n; OPT_YES=y;;</div>

<div> <span style="white-space:pre-wrap"> </span>q) ECHO=:;;</div>

<div> <span style="white-space:pre-wrap"> </span>R) OPT_RESTRIPE=y;;</div>

<div> <span style="white-space:pre-wrap"> </span>s) OPT_CHECK="";;</div>

<div>+<span style="white-space:pre-wrap"> </span>v) OPT_VERBOSE=y;;</div>

<div> <span style="white-space:pre-wrap"> </span>y) OPT_YES=y;;</div>

<div> <span style="white-space:pre-wrap"> </span>0) OPT_NULL=y;;</div>

<div> <span style="white-space:pre-wrap"> </span>h|\?) usage;;</div>

<div>@@ -69,6 +75,16 @@</div>

<div> <span style="white-space:pre-wrap"> </span>echo "$(basename $0) error: The -c <stripe_count> option may not" 1>&2</div>

<div> <span style="white-space:pre-wrap"> </span>echo "be specified at the same time as the -R option." 1>&2</div>

<div> <span style="white-space:pre-wrap"> </span>exit 1</div>

<div>+elif [ "$OPT_STRIPE_COUNT" -a "$OPT_AUTOSTRIPE" ]; then</div>

<div>+<span style="white-space:pre-wrap"> </span>echo ""</div>

<div>+<span style="white-space:pre-wrap"> </span>echo "$(basename $0) error: The -c <stripe_count> option may not" 1>&2</div>

<div>+<span style="white-space:pre-wrap"> </span>echo "be specified at the same time as the -A option." 1>&2</div>

<div>+<span style="white-space:pre-wrap"> </span>exit 1</div>

<div>+elif [ "$OPT_AUTOSTRIPE" -a "$OPT_RESTRIPE" ]; then</div>

<div>+<span style="white-space:pre-wrap"> </span>echo ""</div>

<div>+<span style="white-space:pre-wrap"> </span>echo "$(basename $0) error: The -A option may not be specified at" 1>&2</div>

<div>+<span style="white-space:pre-wrap"> </span>echo "the same time as the -R option." 1>&2</div>

<div>+<span style="white-space:pre-wrap"> </span>exit 1</div>

<div> fi</div>

<div> </div>

<div> if [ -z "$OPT_YES" ]; then</div>

<div>@@ -107,7 +123,7 @@</div>

<div> <span style="white-space:pre-wrap"> </span>$ECHO -n "$OLDNAME: "</div>

<div> </div>

<div> <span style="white-space:pre-wrap"> </span># avoid duplicate stat if possible</div>

<div>-<span style="white-space:pre-wrap"> </span>TYPE_LINK=($(LANG=C stat -c "%h %F" "$OLDNAME" || true))</div>

<div>+<span style="white-space:pre-wrap"> </span>TYPE_LINK=($(LANG=C stat -c "%h %F %s" "$OLDNAME" || true))</div>

<div> </div>

<div> <span style="white-space:pre-wrap"> </span># skip non-regular files, since they don't have any objects</div>

<div> <span style="white-space:pre-wrap"> </span># and there is no point in trying to migrate them.</div>

<div>@@ -127,11 +143,6 @@</div>

<div> <span style="white-space:pre-wrap"> </span>continue</div>

<div> <span style="white-space:pre-wrap"> </span>fi</div>

<div> </div>

<div>-<span style="white-space:pre-wrap"> </span>if [ "$OPT_DRYRUN" ]; then</div>

<div>-<span style="white-space:pre-wrap"> </span>echo -e "dry run, skipped"</div>

<div>-<span style="white-space:pre-wrap"> </span>continue</div>

<div>-<span style="white-space:pre-wrap"> </span>fi</div>

<div>-</div>

<div> <span style="white-space:pre-wrap"> </span>if [ "$OPT_RESTRIPE" ]; then</div>

<div> <span style="white-space:pre-wrap"> </span>UNLINK=""</div>

<div> <span style="white-space:pre-wrap"> </span>else</div>

<div>@@ -140,16 +151,43 @@</div>

<div> <span style="white-space:pre-wrap"> </span># then we don't need to do this getstripe/mktemp stuff.</div>

<div> <span style="white-space:pre-wrap"> </span>UNLINK="-u"</div>

<div> </div>

<div>-<span style="white-space:pre-wrap"> </span>[ "$OPT_STRIPE_COUNT" ] && COUNT=$OPT_STRIPE_COUNT ||</div>

<div>-<span style="white-space:pre-wrap"> </span>COUNT=$($LFS getstripe -c "$OLDNAME" \</div>

<div>-<span style="white-space:pre-wrap"> </span>2> /dev/null)</div>

<div> <span style="white-space:pre-wrap"> </span>SIZE=$($LFS getstripe $LFS_SIZE_OPT "$OLDNAME" \</div>

<div> <span style="white-space:pre-wrap"> </span>      2> /dev/null)</div>

<div>+<span style="white-space:pre-wrap"> </span>if [ "$OPT_AUTOSTRIPE" ]; then</div>

<div>+<span style="white-space:pre-wrap"> </span>FILE_SIZE=${TYPE_LINK[3]}<br>

</div>

<div>+<span style="white-space:pre-wrap"> </span># (math in bash is dumb, so depend on common tools, and there are options for that...)</div>

<div>+<span style="white-space:pre-wrap"> </span># Stripe Count = Log2(size_in_GB)</div>

<div>+<span style="white-space:pre-wrap"> </span>#COUNT=$(echo $FILE_SIZE | awk '{printf "%.0f\n",log($1/1024/1024/1024)/log(2)}')</div>

<div>+<span style="white-space:pre-wrap"> </span>#COUNT=$(printf "%.0f\n" $(echo "l($FILE_SIZE/1024/1024/1024) / l(2)" | bc -l))</div>

<div>+<span style="white-space:pre-wrap"> </span>COUNT=$(echo "l($FILE_SIZE/1024/1024/1024) / l(2) + 1" | bc -l | cut -d . -f 1)</div>

<div>+<span style="white-space:pre-wrap"> </span># Stripe Count = size_in_GB</div>

<div>+<span style="white-space:pre-wrap"> </span>#COUNT=$(echo "scale=0; $FILE_SIZE/1024/1024/1024" | bc -l | cut -d . -f 1)</div>

</div>

</div>

</div>

</div>

</blockquote>

</span>

<div><br>

</div>

<div>Instead of involving "bc", which is not guaranteed to be installed, why not just have a simple "divide by 2, increment stripe_count" loop after converting bytes to GiB?  That would be a few cycles for huge files, but probably still faster than fork/exec

 of an external binary as it could be at most 63 - 30 = 33 loops and usually many fewer.</div>

<div><br>

</div>

<div>Cheers, Andreas</div>

<div><br>

</div>

<span>

<blockquote style="BORDER-LEFT:#b5c4df 5 solid;PADDING:0 0 0 5;MARGIN:0 0 0 5">

<div dir="ltr">

<div class="gmail_quote">

<div dir="ltr">

<div>

<div>+<span style="white-space:pre-wrap"> </span>[ "$COUNT" -lt 1 ] && COUNT=1</div>

<div>+<span style="white-space:pre-wrap"> </span># (does it make sense to skip the file if old</div>

<div>+<span style="white-space:pre-wrap"> </span># and new stripe count are identical?)</div>

<div>+<span style="white-space:pre-wrap"> </span>else</div>

<div>+<span style="white-space:pre-wrap"> </span>[ "$OPT_STRIPE_COUNT" ] && COUNT=$OPT_STRIPE_COUNT ||</div>

<div>+<span style="white-space:pre-wrap"> </span>COUNT=$($LFS getstripe -c "$OLDNAME" \</div>

<div>+<span style="white-space:pre-wrap"> </span>2> /dev/null)</div>

<div>+<span style="white-space:pre-wrap"> </span>fi</div>

<div> </div>

<div> <span style="white-space:pre-wrap"> </span>[ -z "$COUNT" -o -z "$SIZE" ] && UNLINK=""</div>

<div>-<span style="white-space:pre-wrap"> </span>SIZE=${LFS_SIZE_OPT}${SIZE}</div>

<div> <span style="white-space:pre-wrap"> </span>fi</div>

<div> </div>

<div>+<span style="white-space:pre-wrap"> </span>if [ "$OPT_DRYRUN" ]; then</div>

<div>+<span style="white-space:pre-wrap"> </span>if [ "$OPT_VERBOSE" ]; then</div>

<div>+<span style="white-space:pre-wrap"> </span>echo -e "dry run, would use count=${COUNT} size=${SIZE}"</div>

<div>+<span style="white-space:pre-wrap"> </span>else</div>

<div>+<span style="white-space:pre-wrap"> </span>echo -e "dry run, skipped"</div>

<div>+<span style="white-space:pre-wrap"> </span>fi</div>

<div>+<span style="white-space:pre-wrap"> </span>continue</div>

<div>+<span style="white-space:pre-wrap"> </span>fi</div>

<div>+<span style="white-space:pre-wrap"> </span>if [ "$OPT_VERBOSE" ]; then</div>

<div>+<span style="white-space:pre-wrap"> </span>echo -n "(count=${COUNT} size=${SIZE}) "</div>

<div>+<span style="white-space:pre-wrap"> </span>fi</div>

<div>+</div>

<div>+<span style="white-space:pre-wrap"> </span>[ "$SIZE" ] && SIZE=${LFS_SIZE_OPT}${SIZE}</div>

<div>+</div>

<div> <span style="white-space:pre-wrap"> </span># first try to migrate inside lustre</div>

<div> <span style="white-space:pre-wrap"> </span># if failed go back to old rsync mode</div>

<div> <span style="white-space:pre-wrap"> </span>if [[ $RSYNC_MODE == false ]]; then</div>

</div>

<div><br>

</div>

</div>

</div>

</div>

</blockquote>

</span>

</div>

</blockquote></div><br></div></div>