[Lustre-discuss] Recovery without end
Brian J. Murrell
Brian.Murrell at Sun.COM
Wed Feb 25 08:03:23 PST 2009
On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote:
>
> Our /proc/sys/lustre/timeout is 1000
That's way to high. Long recoveries are exactly the reason you don't
want this number to be huge.
> - there has been some debate on
> this large value here, but most other installation will not run in a
> network environment with a setup as crazy as ours.
What's so crazy about your set up? Unless your network is very flaky
and/or you have not tuned your OSSes properly, there should be no need
for such a high timeout and if there is you need to address the problems
requiring it.
> Putting the timeout
> to 100 immediately results in "Transport endpoint" errors, impossible to
> run Lustre like this.
300 is the max that we recommend and we have very large production
clusters that use such values successfully.
> Since this is a 1.6.5.1 system, I activated the adaptive timeouts - and
> put them to equally large values,
> /sys/module/ptlrpc/parameters/at_max = 6000
> /sys/module/ptlrpc/parameters/at_history = 6000
> /sys/module/ptlrpc/parameters/at_early_margin = 50
> /sys/module/ptlrpc/parameters/at_extra = 30
This is likely not good as well. I will let somebody more knowledgeable
about AT comment in detail though. It's a new feature and not getting
wide use at all yet, so the real-world experience is still low.
b.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090225/5f1e2c71/attachment.pgp>
More information about the lustre-discuss
mailing list