[lustre-discuss] stripe count recommendation, and proposal for auto-stripe tool

Sun May 29 11:15:04 PDT 2016

Definitely the OST count is the upper limit on the stripe count. Fortunately, that won't limit the file size to 100G x stripe_count. There _is_ a limit of 16TB per object for ldiskfs and 2^63 byte limit for ZFS (at least after LU-7890 is fixed).

Cheers, Andreas

On May 29, 2016, at 11:16, E.S. Rosenberg <esr+lustre at mail.hebrew.edu<mailto:esr+lustre at mail.hebrew.edu>> wrote:

After (finally) reading this interesting discussion I was left with one question:
Some of the rules suggested above would imply quite a large amount of stripes as files get truly big, isn't the (logical) upper limit on striping the amount of OSTs you have in the system?
Striping more then the OST count intuitively would seem to be counter-productive (granted that we on a fairly small system already have 15 OSTs so depending on which rule was used the files could be up to 1.5T under the 100G rule)...

Thanks,
Eli

On Thu, May 26, 2016 at 8:13 PM, Nathan Dauchy - NOAA Affiliate <nathan.dauchy at noaa.gov<mailto:nathan.dauchy at noaa.gov>> wrote:
Andreas,

Thanks very much for your comments...

On Wed, May 18, 2016 at 1:30 PM, Dilger, Andreas <andreas.dilger at intel.com<mailto:andreas.dilger at intel.com>> wrote:

On 2016/05/18, 11:22, "Nathan Dauchy - NOAA Affiliate" <nathan.dauchy at noaa.gov<mailto:nathan.dauchy at noaa.gov>> wrote:
I'm looking for your experience and perhaps some lively discussion regarding "best practices" for choosing a file stripe count.  The Lustre manual has good tips on "Choosing a Stripe Size", and in practice the default 1M rarely causes problems on our systems. Stripe Count on the other hand is far more difficult to chose a single value that is efficient for a general purpose and multi-use site-wide file system.
What do you all recommend as a reasonable rule of thumb that works for "most" user's needs, where stripe count can be determined based only on static data attributes (such as file size)?
Using the log2() value seems reasonable.

Ideally, I would like to have a tool to give the users and say "go restripe your directory with this command" and it will do the right thing in 90% of cases.  See the rough patch to lfs_migrate (included below) which should help explain what I'm thinking.  Probably there are more efficient ways of doing things, but I have tested it lightly and it works as a proof-of-concept.

I'd welcome this as a patch submitted to Gerrit.

A Jira ticket has been created:
https://jira.hpdd.intel.com/browse/LU-8207

The draft patch is there, and probably needs a bit of work before pushing into Gerrit.  If anyone wants to tackle that, assistance appreciated of course! :)

With a good programmatic rule of thumb, we (as a Lustre community!) can eventually work with application developers to embed the stripe count selection into their code and get things at least closer to right up front.  Even if trial and error is involved to find the optimal setting, at least the rule of thumb can be a _starting_point_ for the users, and they can tweak it from there based on application, model, scale, dataset, etc.

Thinking farther down the road, with progressive file layout, what algorithm will be used as the default?

To be clear, the PFL implementation does not currently have an algorithmic layout, rather a series of thresholds based on file size that will select different layouts (initially stripe counts, but could be anything including stripe size, OST pools, etc).  The PFL size thresholds and stripe counts _could_ be set up (manually) as as a geometric series, but they can also be totally arbitrary if you want.

Understood.  However, Lustre will still need to have some sort of default layout.  I was thinking that it would be good to match that future code with current best-practice recommendations and whatever ends up in lfs_migrate for auto-striping.

If Lustre gets to the point where it can rebalance OST capacity behind the scenes, could it also make some intelligent choice about restriping very large files to spread out load and better balance capacity?  (Would that mean we need a bit set on the file to flag whether the stripe info was set specifically by the user or automatically by Lustre tools or it was just using the system default?)  Can the filesystem track concurrent access to a file, and perhaps migrate the file and adjust stripe count based on number of active clients?

I think this would be an interesting task for RobinHood, since it already has much of this information.  It could find large files with low stripe counts and restripe them during OST rebalancing.

Yes, the need to rebalance OSTs when adding new ones to the file system is in part what prompted this topic.  We have only experimented with Robinhood as a low-priority task, but hope to use it more in the future.

I was picturing that the general rebalance process (without robinhood) would be something like:

* Identify the most full OSTs with something like:
# lfs df $FS | grep OST | sort -k 4 -n | head -n 4

* Search for singly-striped, large, and inactive files on those OSTs with:
# lfs find * -type f -mtime +30 -size +8G -c 1 -O A,B,N,X > filelist

* Restripe those files with:
# lfs_migrate -A -y < filelist

One last comment on the patch below:
Instead of involving "bc", which is not guaranteed to be installed, why not just have a simple "divide by 2, increment stripe_count" loop after converting bytes to GiB?  That would be a few cycles for huge files, but probably still faster than fork/exec of an external binary as it could be at most 63 - 30 = 33 loops and usually many fewer.

Good point.  I made a note to that effect in the Jira ticket.  In general, I would think that external commands in the "coreutils" package are OK (cut, wc, head, tr, comm) but others (bc, sed, awk, grep) should be avoided.

Cheers,
Nathan

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160529/9d9c9f26/attachment-0001.htm>