[lustre-discuss] stripe count recommendation, and proposal for auto-stripe tool

Thu May 26 10:13:26 PDT 2016

Andreas,

Thanks very much for your comments...

On Wed, May 18, 2016 at 1:30 PM, Dilger, Andreas <andreas.dilger at intel.com>
wrote:

>
> On 2016/05/18, 11:22, "Nathan Dauchy - NOAA Affiliate" <
> nathan.dauchy at noaa.gov> wrote:
>
> I'm looking for your experience and perhaps some lively discussion
> regarding "best practices" for choosing a file stripe count.  The Lustre
> manual has good tips on "Choosing a Stripe Size", and in practice the
> default 1M rarely causes problems on our systems. Stripe Count on the other
> hand is far more difficult to chose a single value that is efficient for a
> general purpose and multi-use site-wide file system.
> What do you all recommend as a reasonable rule of thumb that works for
> "most" user's needs, where stripe count can be determined based only on
> static data attributes (such as file size)?
>
> Using the log2() value seems reasonable.
>
> Ideally, I would like to have a tool to give the users and say "go
> restripe your directory with this command" and it will do the right thing
> in 90% of cases.  See the rough patch to lfs_migrate (included below) which
> should help explain what I'm thinking.  Probably there are more efficient
> ways of doing things, but I have tested it lightly and it works as a
> proof-of-concept.
>
>
> I'd welcome this as a patch submitted to Gerrit.
>
>
A Jira ticket has been created:
https://jira.hpdd.intel.com/browse/LU-8207

The draft patch is there, and probably needs a bit of work before pushing
into Gerrit.  If anyone wants to tackle that, assistance appreciated of
course! :)

With a good programmatic rule of thumb, we (as a Lustre community!) can
> eventually work with application developers to embed the stripe count
> selection into their code and get things at least closer to right up
> front.  Even if trial and error is involved to find the optimal setting, at
> least the rule of thumb can be a _starting_point_ for the users, and they
> can tweak it from there based on application, model, scale, dataset, etc.
>
> Thinking farther down the road, with progressive file layout, what
> algorithm will be used as the default?
>
>
> To be clear, the PFL implementation does not currently have an algorithmic
> layout, rather a series of thresholds based on file size that will select
> different layouts (initially stripe counts, but could be anything including
> stripe size, OST pools, etc).  The PFL size thresholds and stripe counts
> _could_ be set up (manually) as as a geometric series, but they can also be
> totally arbitrary if you want.
>

Understood.  However, Lustre will still need to have some sort of default
layout.  I was thinking that it would be good to match that future code
with current best-practice recommendations and whatever ends up in
lfs_migrate for auto-striping.

>
> If Lustre gets to the point where it can rebalance OST capacity behind the
> scenes, could it also make some intelligent choice about restriping very
> large files to spread out load and better balance capacity?  (Would that
> mean we need a bit set on the file to flag whether the stripe info was set
> specifically by the user or automatically by Lustre tools or it was just
> using the system default?)  Can the filesystem track concurrent access to a
> file, and perhaps migrate the file and adjust stripe count based on number
> of active clients?
>
>
> I think this would be an interesting task for RobinHood, since it already
> has much of this information.  It could find large files with low stripe
> counts and restripe them during OST rebalancing.
>

Yes, the need to rebalance OSTs when adding new ones to the file system is
in part what prompted this topic.  We have only experimented with Robinhood
as a low-priority task, but hope to use it more in the future.

I was picturing that the general rebalance process (without robinhood)
would be something like:

* Identify the most full OSTs with something like:
# lfs df $FS | grep OST | sort -k 4 -n | head -n 4

* Search for singly-striped, large, and inactive files on those OSTs with:
# lfs find * -type f -mtime +30 -size +8G -c 1 -O A,B,N,X > filelist

* Restripe those files with:
# lfs_migrate -A -y < filelist

> One last comment on the patch below:
> Instead of involving "bc", which is not guaranteed to be installed, why
> not just have a simple "divide by 2, increment stripe_count" loop after
> converting bytes to GiB?  That would be a few cycles for huge files, but
> probably still faster than fork/exec of an external binary as it could be
> at most 63 - 30 = 33 loops and usually many fewer.
>

Good point.  I made a note to that effect in the Jira ticket.  In general,
I would think that external commands in the "coreutils" package are OK
(cut, wc, head, tr, comm) but others (bc, sed, awk, grep) should be avoided.

Cheers,
Nathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160526/52ac5bf7/attachment.htm>