[lustre-discuss] stripe count recommendation, and proposal for auto-stripe tool

Mohr Jr, Richard Frank (Rick Mohr) rmohr at utk.edu
Wed May 18 13:04:35 PDT 2016


> On May 18, 2016, at 1:22 PM, Nathan Dauchy - NOAA Affiliate <nathan.dauchy at noaa.gov> wrote:
> 
> Since there is the "increased overhead" of striping, and weather applications do unfortunately write MANY tiny files, we usually keep the filesystem default stripe count at 1.  Unfortunately, there are several users who then write very large and shared-access files with that default.  I would like to be able to tell them to restripe... but without digging into the specific application and access pattern it is hard to know what count to recommend.  Plus there is the "stripe these but not those" confusion... it is common for users to have a few very large data files and many small log or output image files in the SAME directory.
> 
> What do you all recommend as a reasonable rule of thumb that works for "most" user's needs, where stripe count can be determined based only on static data attributes (such as file size)?  I have heard a "stripe per GB" idea, but some have said that escalates to too many stripes too fast.  ORNL has a knowledge base article that says use a stripe count of "File size / 100 GB", but does that make sense for smaller, non-DOE sites?  Would stripe count = Log2(size_in_GB)+1 be more generally reasonable?  For a 1 TB file, that actually works out to be similar to ORNL, only gets there more gradually:
>     https://www.olcf.ornl.gov/kb_articles/lustre-basics/#Stripe_Count
> 

If you are trying to select a stripe count for a file, and all you know is the file size, then I think there are two main ways to handle it:

1) Use a bucket approach (If  0 < file_size < S1, then stripe_count=1.  If S1 < file_size < S2, then stripe_count=4, etc.)

2) Use some sort of formula (like ORNL’s “file_size/100GB” or even your log function)

Since I mainly care about striping for large files and I want the stripe count to increase as file size grows, I prefer to go with option #2.  The second option also has the advantage that it is easier for users to remember (since there aren’t 3 or 4 break points for file sizes that users have to keep track of).  And if simplicity is a driving factor, then I would personally opt for something like “(file_size)/100GB” as opposed to "Log2(size_in_GB)+1”.

For my users with large files, I recommend stripe_count=(file_size)/150GB.  The reason I chose this value is that 150GB is about 1% of my OSTs total capacity.  So if a user follows this recommendation and creates a very large file, any given OST’s usage will only increase by about 1% (which hopefully keeps any single OST’s usage from suddenly spiking to 90+% and prevents Nagios from paging me :-)

And while I am on the topic of OST usage, I would recommend monitoring the distribution of your OST usage if you are not already doing so.  I have found that normally the distribution of OST usage follows kind of a bell curve with the highest OST usage being 3-4% larger than the overall file system usage.  But when a file is seriously misstriped, I often see the OST usage on the high end of the curve falling outside the normal range.  For example, if I run “lfs df -h /lustre/medusa | sort -nk 5” on my file system right now, I see something like this:

medusa-OST0021_UUID        14.2T       10.2T        3.3T  75% /lustre/medusa[OST:33]
medusa-OST0034_UUID        14.2T       10.3T        3.2T  76% /lustre/medusa[OST:52]
medusa-OST0051_UUID        14.2T       10.2T        3.3T  76% /lustre/medusa[OST:81]
medusa-OST0004_UUID        14.2T       10.4T        3.1T  77% /lustre/medusa[OST:4]
medusa-OST0011_UUID        14.2T       10.4T        3.1T  77% /lustre/medusa[OST:17]
…
medusa-OST0015_UUID        14.2T       11.0T        2.5T  81% /lustre/medusa[OST:21]
medusa-OST001e_UUID        14.2T       10.9T        2.6T  81% /lustre/medusa[OST:30]
medusa-OST0025_UUID        14.2T       10.9T        2.6T  81% /lustre/medusa[OST:37]
medusa-OST002a_UUID        14.2T       11.0T        2.6T  81% /lustre/medusa[OST:42]
medusa-OST0030_UUID        14.2T       10.9T        2.6T  81% /lustre/medusa[OST:48]
medusa-OST0038_UUID        14.2T       10.9T        2.6T  81% /lustre/medusa[OST:56]
medusa-OST0056_UUID        14.2T       11.0T        2.5T  81% /lustre/medusa[OST:86]
medusa-OST0012_UUID        14.2T       11.1T        2.4T  82% /lustre/medusa[OST:18]
medusa-OST0018_UUID        14.2T       11.1T        2.4T  82% /lustre/medusa[OST:24]
medusa-OST0020_UUID        14.2T       11.0T        2.5T  82% /lustre/medusa[OST:32]
medusa-OST0039_UUID        14.2T       11.0T        2.5T  82% /lustre/medusa[OST:57]
medusa-OST0001_UUID        14.2T       11.5T        2.0T  85% /lustre/medusa[OST:1]
medusa-OST001c_UUID        14.2T       11.6T        1.9T  86% /lustre/medusa[OST:28]
filesystem summary:         1.3P      959.6T      256.8T  79% /lustre/medusa

The least full OST is 75% (which is 4% below the overall usage).  On the high end, I would expect to see several OST at 81%, a few at 82%, and maybe one or two at 83%.  Instead, I see two OSTs at 85% and 86% which fall outside the norm.  Since the default stripe count for my file system is 2, this is an excellent indication that someone has a misstriped file.

Which means that I need to stop typing now and track down the user who is messing up my nice file system….

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu



More information about the lustre-discuss mailing list