[lustre-discuss] stripe count recommendation, and proposal for auto-stripe tool

Thu May 19 08:57:08 PDT 2016

Ah, of course - We're only talking about restriping existing stuff.

Yes, that's just fine - No lock conflicts on reading.  Looks good to me.

This is probably also something we'd want to allow via HSM.  Not sure 
how the current patches interact with that (haven't looked).

- Patrick

On 05/19/2016 10:53 AM, Nathan Dauchy - NOAA Affiliate wrote:
> Patrick,
>
> You bring up an interesting point on read vs. write performance.  We 
> can't use lfs_migrate control the stripe count used for writes 
> (obviously), so that is left up to the application developer or at 
> least the user to intelligently place shared access files in a 
> directory with wider striping. Restriping a file with lfs_migrate 
> could change *read* performance characteristics, so there is indeed 
> some risk there... but your work implies that is not too bad.  If we 
> only restripe files that are "old", then the likelyhood that they will 
> be read again goes way down, and balancing capacity used plays a 
> bigger factor.  Bottom line is that I think restriping has more 
> potential for upsides than down. :)
>
> Thanks,
> Nathan
>
>
> On Wed, May 18, 2016 at 1:22 PM, Patrick Farrell <paf at cray.com 
> <mailto:paf at cray.com>> wrote:
>
>     Nathan,
>
>     This *is* excellent fodder for discussion.
>
>     A few thoughts from a developer perspective.  When you stripe a
>     file to multiple OSTs, you're spreading the data out across
>     multiple targets, which (to my mind) has two purposes:
>     1) More even space usage across OSTs (mostly relevant for *really*
>     big files, since in general, singly striped files are distributed
>     across OSTs anyway)
>     2) Better bandwidth/parallelism for accesses to the file.
>
>     The first one lends itself well to a file size based heuristic,
>     but I'm not sure the second one does. That's more about access
>     patterns.  I'm not sure that you see much bandwidth benefit from
>     striping with a single client, at least as long as an individual
>     OST is fast relative to a client (increasingly common, I think,
>     with flash and larger RAID arrays).  So then, whatever the file
>     size, if it's accessed from one client, it should probably be
>     single striped.
>
>     Also, for shared files, client count relative to stripe count has
>     a huge impact on write performance. Assuming strided I/O patterns,
>     anything more than 1 client per stripe/OST is actually worse than
>     1 client.  (See my lock ahead presentation at LUG'15 for more on
>     this.)  Read performance doesn't share this weirdness, though.
>
>     All that's to say that for case 2 above, at least for writing,
>     it's access pattern/access parallelism, not size, which matters. 
>     I'm sure there's some correlation between file size and how
>     parallel the access pattern is, but it might be very loose, and at
>     least write performance doesn't scale linearly with stripe size. 
>     Instead, the behavior is complex.
>
>     So in order to pick an ideal striping with case 2 in mind, you
>     really need to understand the application access pattern.  I can't
>     see another way to do that goal justice.  (The Lustre ADIO in the
>     MPI I/O library does this, partly by controlling the I/O pattern
>     through I/O aggregation for collective I/Os.)
>
>     So I think your tool can definitely help with case 1, not so sure
>     about case 2.
>
>     - Patrick
>
>     On 05/18/2016 12:22 PM, Nathan Dauchy - NOAA Affiliate wrote:
>>     Greetings All,
>>
>>     I'm looking for your experience and perhaps some lively
>>     discussion regarding "best practices" for choosing a file stripe
>>     count.  The Lustre manual has good tips on "Choosing a Stripe
>>     Size", and in practice the default 1M rarely causes problems on
>>     our systems. Stripe Count on the other hand is far more difficult
>>     to chose a single value that is efficient for a general purpose
>>     and multi-use site-wide file system.
>>
>>     Since there is the "increased overhead" of striping, and weather
>>     applications do unfortunately write MANY tiny files, we usually
>>     keep the filesystem default stripe count at 1.  Unfortunately,
>>     there are several users who then write very large and
>>     shared-access files with that default.  I would like to be able
>>     to tell them to restripe... but without digging into the specific
>>     application and access pattern it is hard to know what count to
>>     recommend.  Plus there is the "stripe these but not those"
>>     confusion... it is common for users to have a few very large data
>>     files and many small log or output image files in the SAME directory.
>>
>>     What do you all recommend as a reasonable rule of thumb that
>>     works for "most" user's needs, where stripe count can be
>>     determined based only on static data attributes (such as file
>>     size)?  I have heard a "stripe per GB" idea, but some have said
>>     that escalates to too many stripes too fast.  ORNL has a
>>     knowledge base article that says use a stripe count of "File size
>>     / 100 GB", but does that make sense for smaller, non-DOE sites?
>>     Would stripe count = Log2(size_in_GB)+1 be more generally
>>     reasonable?  For a 1 TB file, that actually works out to be
>>     similar to ORNL, only gets there more gradually:
>>     https://www.olcf.ornl.gov/kb_articles/lustre-basics/#Stripe_Count
>>
>>     Ideally, I would like to have a tool to give the users and say
>>     "go restripe your directory with this command" and it will do the
>>     right thing in 90% of cases.  See the rough patch to lfs_migrate
>>     (included below) which should help explain what I'm thinking. 
>>     Probably there are more efficient ways of doing things, but I
>>     have tested it lightly and it works as a proof-of-concept.
>>
>>     With a good programmatic rule of thumb, we (as a Lustre
>>     community!) can eventually work with application developers to
>>     embed the stripe count selection into their code and get things
>>     at least closer to right up front.  Even if trial and error is
>>     involved to find the optimal setting, at least the rule of thumb
>>     can be a _starting_point_ for the users, and they can tweak it
>>     from there based on application, model, scale, dataset, etc.
>>
>>     Thinking farther down the road, with progressive file layout,
>>     what algorithm will be used as the default?  If Lustre gets to
>>     the point where it can rebalance OST capacity behind the scenes,
>>     could it also make some intelligent choice about restriping very
>>     large files to spread out load and better balance capacity?
>>      (Would that mean we need a bit set on the file to flag whether
>>     the stripe info was set specifically by the user or automatically
>>     by Lustre tools or it was just using the system default?)  Can
>>     the filesystem track concurrent access to a file, and perhaps
>>     migrate the file and adjust stripe count based on number of
>>     active clients?
>>
>>     I appreciate any and all suggestions, clarifying questions,
>>     heckles, etc.  I know this is a lot of questions, and I certainly
>>     don't expect definitive answers on all of them, but I hope it is
>>     at least food for thought and discussion! :)
>>
>>     Thanks,
>>     Nathan
>>
>>
>>     --- lfs_migrate-2.7.12016-05-13 12:46:06.828032000 +0000
>>     +++ lfs_migrate.auto-count2016-05-17 21:37:19.036589000 +0000
>>     @@ -21,8 +21,10 @@
>>      usage() {
>>          cat -- <<USAGE 1>&2
>>     -usage: lfs_migrate [-c <stripe_count>] [-h] [-l] [-n] [-q] [-R]
>>     [-s] [-y] [-0]
>>     +usage: lfs_migrate [-A] [-c <stripe_count>] [-h] [-l] [-n] [-q]
>>     [-R] [-s] [-v] [-y] [-0]
>>                         [file|dir ...]
>>     +    -A restripe file using an automatically selected stripe count
>>     +       currently Stripe Count = Log2(size_in_GB)
>>          -c <stripe_count>
>>             restripe file using the specified stripe count
>>          -h show this usage message
>>     @@ -31,11 +33,11 @@
>>          -q run quietly (don't print filenames or status)
>>          -R restripe file using default directory striping
>>          -s skip file data comparison after migrate
>>     +    -v be verbose and print information about each file
>>          -y answer 'y' to usage question
>>          -0 input file names on stdin are separated by a null character
>>     -The -c <stripe_count> option may not be specified at the same
>>     time as
>>     -the -R option.
>>     +Only one of the '-A', '-c', or '-R' options may be specified at
>>     a time.
>>      If a directory is an argument, all files in the directory are
>>     migrated.
>>      If no file/directory is given, the file list is read from
>>     standard input.
>>     @@ -48,15 +50,19 @@
>>      OPT_CHECK=y
>>      OPT_STRIPE_COUNT=""
>>     +OPT_AUTOSTRIPE=""
>>     +OPT_VERBOSE=""
>>     -while getopts "c:hlnqRsy0" opt $*; do
>>     +while getopts "Ac:hlnqRsvy0" opt $*; do
>>          case $opt in
>>     +A) OPT_AUTOSTRIPE=y;;
>>     c) OPT_STRIPE_COUNT=$OPTARG;;
>>     l) OPT_NLINK=y;;
>>     n) OPT_DRYRUN=n; OPT_YES=y;;
>>     q) ECHO=:;;
>>     R) OPT_RESTRIPE=y;;
>>     s) OPT_CHECK="";;
>>     +v) OPT_VERBOSE=y;;
>>     y) OPT_YES=y;;
>>     0) OPT_NULL=y;;
>>     h|\?) usage;;
>>     @@ -69,6 +75,16 @@
>>     echo "$(basename $0) error: The -c <stripe_count> option may not"
>>     1>&2
>>     echo "be specified at the same time as the -R option." 1>&2
>>     exit 1
>>     +elif [ "$OPT_STRIPE_COUNT" -a "$OPT_AUTOSTRIPE" ]; then
>>     +echo ""
>>     +echo "$(basename $0) error: The -c <stripe_count> option may
>>     not" 1>&2
>>     +echo "be specified at the same time as the -A option." 1>&2
>>     +exit 1
>>     +elif [ "$OPT_AUTOSTRIPE" -a "$OPT_RESTRIPE" ]; then
>>     +echo ""
>>     +echo "$(basename $0) error: The -A option may not be specified
>>     at" 1>&2
>>     +echo "the same time as the -R option." 1>&2
>>     +exit 1
>>      fi
>>      if [ -z "$OPT_YES" ]; then
>>     @@ -107,7 +123,7 @@
>>     $ECHO -n "$OLDNAME: "
>>     # avoid duplicate stat if possible
>>     -TYPE_LINK=($(LANG=C stat -c "%h %F" "$OLDNAME" || true))
>>     +TYPE_LINK=($(LANG=C stat -c "%h %F %s" "$OLDNAME" || true))
>>     # skip non-regular files, since they don't have any objects
>>     # and there is no point in trying to migrate them.
>>     @@ -127,11 +143,6 @@
>>     continue
>>     fi
>>     -if [ "$OPT_DRYRUN" ]; then
>>     -echo -e "dry run, skipped"
>>     -continue
>>     -fi
>>     -
>>     if [ "$OPT_RESTRIPE" ]; then
>>     UNLINK=""
>>     else
>>     @@ -140,16 +151,43 @@
>>     # then we don't need to do this getstripe/mktemp stuff.
>>     UNLINK="-u"
>>     -[ "$OPT_STRIPE_COUNT" ] && COUNT=$OPT_STRIPE_COUNT ||
>>     -COUNT=$($LFS getstripe -c "$OLDNAME" \
>>     -2> /dev/null)
>>     SIZE=$($LFS getstripe $LFS_SIZE_OPT "$OLDNAME" \
>>           2> /dev/null)
>>     +if [ "$OPT_AUTOSTRIPE" ]; then
>>     +FILE_SIZE=${TYPE_LINK[3]}
>>     +# (math in bash is dumb, so depend on common tools, and there
>>     are options for that...)
>>     +# Stripe Count = Log2(size_in_GB)
>>     +#COUNT=$(echo $FILE_SIZE | awk '{printf
>>     "%.0f\n",log($1/1024/1024/1024)/log(2)}')
>>     +#COUNT=$(printf "%.0f\n" $(echo "l($FILE_SIZE/1024/1024/1024) /
>>     l(2)" | bc -l))
>>     +COUNT=$(echo "l($FILE_SIZE/1024/1024/1024) / l(2) + 1" | bc -l |
>>     cut -d . -f 1)
>>     +# Stripe Count = size_in_GB
>>     +#COUNT=$(echo "scale=0; $FILE_SIZE/1024/1024/1024" | bc -l | cut
>>     -d . -f 1)
>>     +[ "$COUNT" -lt 1 ] && COUNT=1
>>     +# (does it make sense to skip the file if old
>>     +# and new stripe count are identical?)
>>     +else
>>     +[ "$OPT_STRIPE_COUNT" ] && COUNT=$OPT_STRIPE_COUNT ||
>>     +COUNT=$($LFS getstripe -c "$OLDNAME" \
>>     +2> /dev/null)
>>     +fi
>>     [ -z "$COUNT" -o -z "$SIZE" ] && UNLINK=""
>>     -SIZE=${LFS_SIZE_OPT}${SIZE}
>>     fi
>>     +if [ "$OPT_DRYRUN" ]; then
>>     +if [ "$OPT_VERBOSE" ]; then
>>     +echo -e "dry run, would use count=${COUNT} size=${SIZE}"
>>     +else
>>     +echo -e "dry run, skipped"
>>     +fi
>>     +continue
>>     +fi
>>     +if [ "$OPT_VERBOSE" ]; then
>>     +echo -n "(count=${COUNT} size=${SIZE}) "
>>     +fi
>>     +
>>     +[ "$SIZE" ] && SIZE=${LFS_SIZE_OPT}${SIZE}
>>     +
>>     # first try to migrate inside lustre
>>     # if failed go back to old rsync mode
>>     if [[ $RSYNC_MODE == false ]]; then
>>
>>
>>
>>     _______________________________________________
>>     lustre-discuss mailing list
>>     lustre-discuss at lists.lustre.org  <mailto:lustre-discuss at lists.lustre.org>
>>     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
>     _______________________________________________
>     lustre-discuss mailing list
>     lustre-discuss at lists.lustre.org
>     <mailto:lustre-discuss at lists.lustre.org>
>     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160519/66846d2b/attachment-0001.htm>