Nathan Dauchy - NOAA Affiliate nathan.dauchy at noaa.gov
Wed May 18 10:22:46 PDT 2016

Greetings All,

I'm looking for your experience and perhaps some lively discussion
regarding "best practices" for choosing a file stripe count.  The Lustre
manual has good tips on "Choosing a Stripe Size", and in practice the
default 1M rarely causes problems on our systems. Stripe Count on the other
hand is far more difficult to chose a single value that is efficient for a
general purpose and multi-use site-wide file system.

Since there is the "increased overhead" of striping, and weather
applications do unfortunately write MANY tiny files, we usually keep the
filesystem default stripe count at 1.  Unfortunately, there are several
users who then write very large and shared-access files with that default.
I would like to be able to tell them to restripe... but without digging
into the specific application and access pattern it is hard to know what
count to recommend.  Plus there is the "stripe these but not those"
confusion... it is common for users to have a few very large data files and
many small log or output image files in the SAME directory.

What do you all recommend as a reasonable rule of thumb that works for
"most" user's needs, where stripe count can be determined based only on
static data attributes (such as file size)?  I have heard a "stripe per GB"
idea, but some have said that escalates to too many stripes too fast.  ORNL
has a knowledge base article that says use a stripe count of "File size /
100 GB", but does that make sense for smaller, non-DOE sites?  Would stripe
count = Log2(size_in_GB)+1 be more generally reasonable?  For a 1 TB file,
that actually works out to be similar to ORNL, only gets there more

Ideally, I would like to have a tool to give the users and say "go restripe
your directory with this command" and it will do the right thing in 90% of
cases.  See the rough patch to lfs_migrate (included below) which should
help explain what I'm thinking.  Probably there are more efficient ways of
doing things, but I have tested it lightly and it works as a

With a good programmatic rule of thumb, we (as a Lustre community!) can
eventually work with application developers to embed the stripe count
selection into their code and get things at least closer to right up
front.  Even if trial and error is involved to find the optimal setting, at
least the rule of thumb can be a _starting_point_ for the users, and they
can tweak it from there based on application, model, scale, dataset, etc.

Thinking farther down the road, with progressive file layout, what
algorithm will be used as the default?  If Lustre gets to the point where
it can rebalance OST capacity behind the scenes, could it also make some
intelligent choice about restriping very large files to spread out load and
better balance capacity?  (Would that mean we need a bit set on the file to
flag whether the stripe info was set specifically by the user or
automatically by Lustre tools or it was just using the system default?)
 Can the filesystem track concurrent access to a file, and perhaps migrate
the file and adjust stripe count based on number of active clients?

I appreciate any and all suggestions, clarifying questions, heckles, etc.
I know this is a lot of questions, and I certainly don't expect definitive
answers on all of them, but I hope it is at least food for thought and
discussion! :)


--- lfs_migrate-2.7.1 2016-05-13 12:46:06.828032000 +0000
+++ lfs_migrate.auto-count 2016-05-17 21:37:19.036589000 +0000
@@ -21,8 +21,10 @@

 usage() {
     cat -- <<USAGE 1>&2
-usage: lfs_migrate [-c <stripe_count>] [-h] [-l] [-n] [-q] [-R] [-s] [-y]
+usage: lfs_migrate [-A] [-c <stripe_count>] [-h] [-l] [-n] [-q] [-R] [-s]
[-v] [-y] [-0]
                    [file|dir ...]
+    -A restripe file using an automatically selected stripe count
+       currently Stripe Count = Log2(size_in_GB)
     -c <stripe_count>
        restripe file using the specified stripe count
     -h show this usage message
@@ -31,11 +33,11 @@
     -q run quietly (don't print filenames or status)
     -R restripe file using default directory striping
     -s skip file data comparison after migrate
+    -v be verbose and print information about each file
     -y answer 'y' to usage question
     -0 input file names on stdin are separated by a null character

-The -c <stripe_count> option may not be specified at the same time as
-the -R option.
+Only one of the '-A', '-c', or '-R' options may be specified at a time.

 If a directory is an argument, all files in the directory are migrated.
 If no file/directory is given, the file list is read from standard input.
@@ -48,15 +50,19 @@


-while getopts "c:hlnqRsy0" opt $*; do
+while getopts "Ac:hlnqRsvy0" opt $*; do
     case $opt in
  l) OPT_NLINK=y;;
  n) OPT_DRYRUN=n; OPT_YES=y;;
  q) ECHO=:;;
  s) OPT_CHECK="";;
+ v) OPT_VERBOSE=y;;
  y) OPT_YES=y;;
  0) OPT_NULL=y;;
  h|\?) usage;;
@@ -69,6 +75,16 @@
  echo "$(basename $0) error: The -c <stripe_count> option may not" 1>&2
  echo "be specified at the same time as the -R option." 1>&2
  exit 1
+elif [ "$OPT_STRIPE_COUNT" -a "$OPT_AUTOSTRIPE" ]; then
+ echo ""
+ echo "$(basename $0) error: The -c <stripe_count> option may not" 1>&2
+ echo "be specified at the same time as the -A option." 1>&2
+ exit 1
+elif [ "$OPT_AUTOSTRIPE" -a "$OPT_RESTRIPE" ]; then
+ echo ""
+ echo "$(basename $0) error: The -A option may not be specified at" 1>&2
+ echo "the same time as the -R option." 1>&2
+ exit 1

 if [ -z "$OPT_YES" ]; then
@@ -107,7 +123,7 @@
  $ECHO -n "$OLDNAME: "

  # avoid duplicate stat if possible
- TYPE_LINK=($(LANG=C stat -c "%h %F" "$OLDNAME" || true))
+ TYPE_LINK=($(LANG=C stat -c "%h %F %s" "$OLDNAME" || true))

  # skip non-regular files, since they don't have any objects
  # and there is no point in trying to migrate them.
@@ -127,11 +143,6 @@

- if [ "$OPT_DRYRUN" ]; then
- echo -e "dry run, skipped"
- continue
- fi
  if [ "$OPT_RESTRIPE" ]; then
@@ -140,16 +151,43 @@
  # then we don't need to do this getstripe/mktemp stuff.

- COUNT=$($LFS getstripe -c "$OLDNAME" \
- 2> /dev/null)
  SIZE=$($LFS getstripe $LFS_SIZE_OPT "$OLDNAME" \
        2> /dev/null)
+ if [ "$OPT_AUTOSTRIPE" ]; then
+ # (math in bash is dumb, so depend on common tools, and there are options
for that...)
+ # Stripe Count = Log2(size_in_GB)
+ #COUNT=$(echo $FILE_SIZE | awk '{printf
+ #COUNT=$(printf "%.0f\n" $(echo "l($FILE_SIZE/1024/1024/1024) / l(2)" |
bc -l))
+ COUNT=$(echo "l($FILE_SIZE/1024/1024/1024) / l(2) + 1" | bc -l | cut -d .
-f 1)
+ # Stripe Count = size_in_GB
+ #COUNT=$(echo "scale=0; $FILE_SIZE/1024/1024/1024" | bc -l | cut -d . -f
+ [ "$COUNT" -lt 1 ] && COUNT=1
+ # (does it make sense to skip the file if old
+ # and new stripe count are identical?)
+ else
+ COUNT=$($LFS getstripe -c "$OLDNAME" \
+ 2> /dev/null)
+ fi

  [ -z "$COUNT" -o -z "$SIZE" ] && UNLINK=""

+ if [ "$OPT_DRYRUN" ]; then
+ if [ "$OPT_VERBOSE" ]; then
+ echo -e "dry run, would use count=${COUNT} size=${SIZE}"
+ else
+ echo -e "dry run, skipped"
+ fi
+ continue
+ fi
+ if [ "$OPT_VERBOSE" ]; then
+ echo -n "(count=${COUNT} size=${SIZE}) "
+ fi
+ [ "$SIZE" ] && SIZE=${LFS_SIZE_OPT}${SIZE}
  # first try to migrate inside lustre
  # if failed go back to old rsync mode
  if [[ $RSYNC_MODE == false ]]; then
