[lustre-discuss] stripe count recommendation, and proposal for auto-stripe tool
Patrick Farrell
paf at cray.com
Thu May 19 08:57:08 PDT 2016
Ah, of course - We're only talking about restriping existing stuff.
Yes, that's just fine - No lock conflicts on reading. Looks good to me.
This is probably also something we'd want to allow via HSM. Not sure
how the current patches interact with that (haven't looked).
- Patrick
On 05/19/2016 10:53 AM, Nathan Dauchy - NOAA Affiliate wrote:
> Patrick,
>
> You bring up an interesting point on read vs. write performance. We
> can't use lfs_migrate control the stripe count used for writes
> (obviously), so that is left up to the application developer or at
> least the user to intelligently place shared access files in a
> directory with wider striping. Restriping a file with lfs_migrate
> could change *read* performance characteristics, so there is indeed
> some risk there... but your work implies that is not too bad. If we
> only restripe files that are "old", then the likelyhood that they will
> be read again goes way down, and balancing capacity used plays a
> bigger factor. Bottom line is that I think restriping has more
> potential for upsides than down. :)
>
> Thanks,
> Nathan
>
>
> On Wed, May 18, 2016 at 1:22 PM, Patrick Farrell <paf at cray.com
> <mailto:paf at cray.com>> wrote:
>
> Nathan,
>
> This *is* excellent fodder for discussion.
>
> A few thoughts from a developer perspective. When you stripe a
> file to multiple OSTs, you're spreading the data out across
> multiple targets, which (to my mind) has two purposes:
> 1) More even space usage across OSTs (mostly relevant for *really*
> big files, since in general, singly striped files are distributed
> across OSTs anyway)
> 2) Better bandwidth/parallelism for accesses to the file.
>
> The first one lends itself well to a file size based heuristic,
> but I'm not sure the second one does. That's more about access
> patterns. I'm not sure that you see much bandwidth benefit from
> striping with a single client, at least as long as an individual
> OST is fast relative to a client (increasingly common, I think,
> with flash and larger RAID arrays). So then, whatever the file
> size, if it's accessed from one client, it should probably be
> single striped.
>
> Also, for shared files, client count relative to stripe count has
> a huge impact on write performance. Assuming strided I/O patterns,
> anything more than 1 client per stripe/OST is actually worse than
> 1 client. (See my lock ahead presentation at LUG'15 for more on
> this.) Read performance doesn't share this weirdness, though.
>
> All that's to say that for case 2 above, at least for writing,
> it's access pattern/access parallelism, not size, which matters.
> I'm sure there's some correlation between file size and how
> parallel the access pattern is, but it might be very loose, and at
> least write performance doesn't scale linearly with stripe size.
> Instead, the behavior is complex.
>
> So in order to pick an ideal striping with case 2 in mind, you
> really need to understand the application access pattern. I can't
> see another way to do that goal justice. (The Lustre ADIO in the
> MPI I/O library does this, partly by controlling the I/O pattern
> through I/O aggregation for collective I/Os.)
>
> So I think your tool can definitely help with case 1, not so sure
> about case 2.
>
> - Patrick
>
> On 05/18/2016 12:22 PM, Nathan Dauchy - NOAA Affiliate wrote:
>> Greetings All,
>>
>> I'm looking for your experience and perhaps some lively
>> discussion regarding "best practices" for choosing a file stripe
>> count. The Lustre manual has good tips on "Choosing a Stripe
>> Size", and in practice the default 1M rarely causes problems on
>> our systems. Stripe Count on the other hand is far more difficult
>> to chose a single value that is efficient for a general purpose
>> and multi-use site-wide file system.
>>
>> Since there is the "increased overhead" of striping, and weather
>> applications do unfortunately write MANY tiny files, we usually
>> keep the filesystem default stripe count at 1. Unfortunately,
>> there are several users who then write very large and
>> shared-access files with that default. I would like to be able
>> to tell them to restripe... but without digging into the specific
>> application and access pattern it is hard to know what count to
>> recommend. Plus there is the "stripe these but not those"
>> confusion... it is common for users to have a few very large data
>> files and many small log or output image files in the SAME directory.
>>
>> What do you all recommend as a reasonable rule of thumb that
>> works for "most" user's needs, where stripe count can be
>> determined based only on static data attributes (such as file
>> size)? I have heard a "stripe per GB" idea, but some have said
>> that escalates to too many stripes too fast. ORNL has a
>> knowledge base article that says use a stripe count of "File size
>> / 100 GB", but does that make sense for smaller, non-DOE sites?
>> Would stripe count = Log2(size_in_GB)+1 be more generally
>> reasonable? For a 1 TB file, that actually works out to be
>> similar to ORNL, only gets there more gradually:
>> https://www.olcf.ornl.gov/kb_articles/lustre-basics/#Stripe_Count
>>
>> Ideally, I would like to have a tool to give the users and say
>> "go restripe your directory with this command" and it will do the
>> right thing in 90% of cases. See the rough patch to lfs_migrate
>> (included below) which should help explain what I'm thinking.
>> Probably there are more efficient ways of doing things, but I
>> have tested it lightly and it works as a proof-of-concept.
>>
>> With a good programmatic rule of thumb, we (as a Lustre
>> community!) can eventually work with application developers to
>> embed the stripe count selection into their code and get things
>> at least closer to right up front. Even if trial and error is
>> involved to find the optimal setting, at least the rule of thumb
>> can be a _starting_point_ for the users, and they can tweak it
>> from there based on application, model, scale, dataset, etc.
>>
>> Thinking farther down the road, with progressive file layout,
>> what algorithm will be used as the default? If Lustre gets to
>> the point where it can rebalance OST capacity behind the scenes,
>> could it also make some intelligent choice about restriping very
>> large files to spread out load and better balance capacity?
>> (Would that mean we need a bit set on the file to flag whether
>> the stripe info was set specifically by the user or automatically
>> by Lustre tools or it was just using the system default?) Can
>> the filesystem track concurrent access to a file, and perhaps
>> migrate the file and adjust stripe count based on number of
>> active clients?
>>
>> I appreciate any and all suggestions, clarifying questions,
>> heckles, etc. I know this is a lot of questions, and I certainly
>> don't expect definitive answers on all of them, but I hope it is
>> at least food for thought and discussion! :)
>>
>> Thanks,
>> Nathan
>>
>>
>> --- lfs_migrate-2.7.12016-05-13 12:46:06.828032000 +0000
>> +++ lfs_migrate.auto-count2016-05-17 21:37:19.036589000 +0000
>> @@ -21,8 +21,10 @@
>> usage() {
>> cat -- <<USAGE 1>&2
>> -usage: lfs_migrate [-c <stripe_count>] [-h] [-l] [-n] [-q] [-R]
>> [-s] [-y] [-0]
>> +usage: lfs_migrate [-A] [-c <stripe_count>] [-h] [-l] [-n] [-q]
>> [-R] [-s] [-v] [-y] [-0]
>> [file|dir ...]
>> + -A restripe file using an automatically selected stripe count
>> + currently Stripe Count = Log2(size_in_GB)
>> -c <stripe_count>
>> restripe file using the specified stripe count
>> -h show this usage message
>> @@ -31,11 +33,11 @@
>> -q run quietly (don't print filenames or status)
>> -R restripe file using default directory striping
>> -s skip file data comparison after migrate
>> + -v be verbose and print information about each file
>> -y answer 'y' to usage question
>> -0 input file names on stdin are separated by a null character
>> -The -c <stripe_count> option may not be specified at the same
>> time as
>> -the -R option.
>> +Only one of the '-A', '-c', or '-R' options may be specified at
>> a time.
>> If a directory is an argument, all files in the directory are
>> migrated.
>> If no file/directory is given, the file list is read from
>> standard input.
>> @@ -48,15 +50,19 @@
>> OPT_CHECK=y
>> OPT_STRIPE_COUNT=""
>> +OPT_AUTOSTRIPE=""
>> +OPT_VERBOSE=""
>> -while getopts "c:hlnqRsy0" opt $*; do
>> +while getopts "Ac:hlnqRsvy0" opt $*; do
>> case $opt in
>> +A) OPT_AUTOSTRIPE=y;;
>> c) OPT_STRIPE_COUNT=$OPTARG;;
>> l) OPT_NLINK=y;;
>> n) OPT_DRYRUN=n; OPT_YES=y;;
>> q) ECHO=:;;
>> R) OPT_RESTRIPE=y;;
>> s) OPT_CHECK="";;
>> +v) OPT_VERBOSE=y;;
>> y) OPT_YES=y;;
>> 0) OPT_NULL=y;;
>> h|\?) usage;;
>> @@ -69,6 +75,16 @@
>> echo "$(basename $0) error: The -c <stripe_count> option may not"
>> 1>&2
>> echo "be specified at the same time as the -R option." 1>&2
>> exit 1
>> +elif [ "$OPT_STRIPE_COUNT" -a "$OPT_AUTOSTRIPE" ]; then
>> +echo ""
>> +echo "$(basename $0) error: The -c <stripe_count> option may
>> not" 1>&2
>> +echo "be specified at the same time as the -A option." 1>&2
>> +exit 1
>> +elif [ "$OPT_AUTOSTRIPE" -a "$OPT_RESTRIPE" ]; then
>> +echo ""
>> +echo "$(basename $0) error: The -A option may not be specified
>> at" 1>&2
>> +echo "the same time as the -R option." 1>&2
>> +exit 1
>> fi
>> if [ -z "$OPT_YES" ]; then
>> @@ -107,7 +123,7 @@
>> $ECHO -n "$OLDNAME: "
>> # avoid duplicate stat if possible
>> -TYPE_LINK=($(LANG=C stat -c "%h %F" "$OLDNAME" || true))
>> +TYPE_LINK=($(LANG=C stat -c "%h %F %s" "$OLDNAME" || true))
>> # skip non-regular files, since they don't have any objects
>> # and there is no point in trying to migrate them.
>> @@ -127,11 +143,6 @@
>> continue
>> fi
>> -if [ "$OPT_DRYRUN" ]; then
>> -echo -e "dry run, skipped"
>> -continue
>> -fi
>> -
>> if [ "$OPT_RESTRIPE" ]; then
>> UNLINK=""
>> else
>> @@ -140,16 +151,43 @@
>> # then we don't need to do this getstripe/mktemp stuff.
>> UNLINK="-u"
>> -[ "$OPT_STRIPE_COUNT" ] && COUNT=$OPT_STRIPE_COUNT ||
>> -COUNT=$($LFS getstripe -c "$OLDNAME" \
>> -2> /dev/null)
>> SIZE=$($LFS getstripe $LFS_SIZE_OPT "$OLDNAME" \
>> 2> /dev/null)
>> +if [ "$OPT_AUTOSTRIPE" ]; then
>> +FILE_SIZE=${TYPE_LINK[3]}
>> +# (math in bash is dumb, so depend on common tools, and there
>> are options for that...)
>> +# Stripe Count = Log2(size_in_GB)
>> +#COUNT=$(echo $FILE_SIZE | awk '{printf
>> "%.0f\n",log($1/1024/1024/1024)/log(2)}')
>> +#COUNT=$(printf "%.0f\n" $(echo "l($FILE_SIZE/1024/1024/1024) /
>> l(2)" | bc -l))
>> +COUNT=$(echo "l($FILE_SIZE/1024/1024/1024) / l(2) + 1" | bc -l |
>> cut -d . -f 1)
>> +# Stripe Count = size_in_GB
>> +#COUNT=$(echo "scale=0; $FILE_SIZE/1024/1024/1024" | bc -l | cut
>> -d . -f 1)
>> +[ "$COUNT" -lt 1 ] && COUNT=1
>> +# (does it make sense to skip the file if old
>> +# and new stripe count are identical?)
>> +else
>> +[ "$OPT_STRIPE_COUNT" ] && COUNT=$OPT_STRIPE_COUNT ||
>> +COUNT=$($LFS getstripe -c "$OLDNAME" \
>> +2> /dev/null)
>> +fi
>> [ -z "$COUNT" -o -z "$SIZE" ] && UNLINK=""
>> -SIZE=${LFS_SIZE_OPT}${SIZE}
>> fi
>> +if [ "$OPT_DRYRUN" ]; then
>> +if [ "$OPT_VERBOSE" ]; then
>> +echo -e "dry run, would use count=${COUNT} size=${SIZE}"
>> +else
>> +echo -e "dry run, skipped"
>> +fi
>> +continue
>> +fi
>> +if [ "$OPT_VERBOSE" ]; then
>> +echo -n "(count=${COUNT} size=${SIZE}) "
>> +fi
>> +
>> +[ "$SIZE" ] && SIZE=${LFS_SIZE_OPT}${SIZE}
>> +
>> # first try to migrate inside lustre
>> # if failed go back to old rsync mode
>> if [[ $RSYNC_MODE == false ]]; then
>>
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org <mailto:lustre-discuss at lists.lustre.org>
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> <mailto:lustre-discuss at lists.lustre.org>
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160519/66846d2b/attachment-0001.htm>
More information about the lustre-discuss
mailing list