[lustre-discuss] stripe count recommendation, and proposal for auto-stripe tool

Thu May 19 09:46:09 PDT 2016

On Wed, May 18, 2016 at 2:04 PM, Mohr Jr, Richard Frank (Rick Mohr) <
rmohr at utk.edu> wrote:

>
> 2) Use some sort of formula (like ORNL’s “file_size/100GB” or even your
> log function)
>
> Since I mainly care about striping for large files and I want the stripe
> count to increase as file size grows, I prefer to go with option #2.  The
> second option also has the advantage that it is easier for users to
> remember (since there aren’t 3 or 4 break points for file sizes that users
> have to keep track of).  And if simplicity is a driving factor, then I
> would personally opt for something like “(file_size)/100GB” as opposed to
> "Log2(size_in_GB)+1”.
>
> For my users with large files, I recommend
> stripe_count=(file_size)/150GB.  The reason I chose this value is that
> 150GB is about 1% of my OSTs total capacity.  So if a user follows this
> recommendation and creates a very large file, any given OST’s usage will
> only increase by about 1% (which hopefully keeps any single OST’s usage
> from suddenly spiking to 90+% and prevents Nagios from paging me :-)
>
>
Rick,

Thanks for pointing out the approach of trying to keep a single file from
using too much space on an OST.  It looks like the Log2(size_in_GB) method
I proposed works well up to a point, but breaks down in the capacity
balancing department at some large file size.  If we take your rule of
thumb that no file should use more than 1% of an OST, and a typical new-ish
target made up of an 8+2 RAID10 group of 6TB disks, we should divide by 480
GB rather than 150.  (are your targets a couple years old? or am I making a
bad assumption here?)

Since the size of OSTs is a moving target (pun intended), it would be good
to "future proof" the size selection method.  An automated tool like
"lfs_migrate -A" could actually calculate the changeover point from log2()
your method on the fly.  stripe_count=MAX( Log2(size_in_GB)+1, size_in_GB/X
) where "X" is ~1% of the capacity of the smallest OST.  This strays away
from simplicity for the user, but the tool hides it for restriping, and
initial writes are (hopefully) tuned more for concurrent access pattern
than size anyway.

Thanks,
Nathan

And while I am on the topic of OST usage, I would recommend monitoring the
> distribution of your OST usage if you are not already doing so.  I have
> found that normally the distribution of OST usage follows kind of a bell
> curve with the highest OST usage being 3-4% larger than the overall file
> system usage.  But when a file is seriously misstriped, I often see the OST
> usage on the high end of the curve falling outside the normal range.  For
> example, if I run “lfs df -h /lustre/medusa | sort -nk 5” on my file system
> right now, I see something like this:
>
> medusa-OST0021_UUID        14.2T       10.2T        3.3T  75%
> /lustre/medusa[OST:33]
> medusa-OST0034_UUID        14.2T       10.3T        3.2T  76%
> /lustre/medusa[OST:52]
> medusa-OST0051_UUID        14.2T       10.2T        3.3T  76%
> /lustre/medusa[OST:81]
> medusa-OST0004_UUID        14.2T       10.4T        3.1T  77%
> /lustre/medusa[OST:4]
> medusa-OST0011_UUID        14.2T       10.4T        3.1T  77%
> /lustre/medusa[OST:17]
> …
> medusa-OST0015_UUID        14.2T       11.0T        2.5T  81%
> /lustre/medusa[OST:21]
> medusa-OST001e_UUID        14.2T       10.9T        2.6T  81%
> /lustre/medusa[OST:30]
> medusa-OST0025_UUID        14.2T       10.9T        2.6T  81%
> /lustre/medusa[OST:37]
> medusa-OST002a_UUID        14.2T       11.0T        2.6T  81%
> /lustre/medusa[OST:42]
> medusa-OST0030_UUID        14.2T       10.9T        2.6T  81%
> /lustre/medusa[OST:48]
> medusa-OST0038_UUID        14.2T       10.9T        2.6T  81%
> /lustre/medusa[OST:56]
> medusa-OST0056_UUID        14.2T       11.0T        2.5T  81%
> /lustre/medusa[OST:86]
> medusa-OST0012_UUID        14.2T       11.1T        2.4T  82%
> /lustre/medusa[OST:18]
> medusa-OST0018_UUID        14.2T       11.1T        2.4T  82%
> /lustre/medusa[OST:24]
> medusa-OST0020_UUID        14.2T       11.0T        2.5T  82%
> /lustre/medusa[OST:32]
> medusa-OST0039_UUID        14.2T       11.0T        2.5T  82%
> /lustre/medusa[OST:57]
> medusa-OST0001_UUID        14.2T       11.5T        2.0T  85%
> /lustre/medusa[OST:1]
> medusa-OST001c_UUID        14.2T       11.6T        1.9T  86%
> /lustre/medusa[OST:28]
> filesystem summary:         1.3P      959.6T      256.8T  79%
> /lustre/medusa
>
> The least full OST is 75% (which is 4% below the overall usage).  On the
> high end, I would expect to see several OST at 81%, a few at 82%, and maybe
> one or two at 83%.  Instead, I see two OSTs at 85% and 86% which fall
> outside the norm.  Since the default stripe count for my file system is 2,
> this is an excellent indication that someone has a misstriped file.
>
> Which means that I need to stop typing now and track down the user who is
> messing up my nice file system….
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160519/ddbdc908/attachment.htm>