[Lustre-discuss] Recovery without end

Wed Feb 25 08:03:23 PST 2009

On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote:
> 
> Our /proc/sys/lustre/timeout is 1000

That's way to high.  Long recoveries are exactly the reason you don't
want this number to be huge.

>  - there has been some debate on
> this large value here, but most other installation will not run in a
> network environment with a setup as crazy as ours.

What's so crazy about your set up?  Unless your network is very flaky
and/or you have not tuned your OSSes properly, there should be no need
for such a high timeout and if there is you need to address the problems
requiring it.

> Putting the timeout
> to 100 immediately results in "Transport endpoint" errors, impossible to
> run Lustre like this.

300 is the max that we recommend and we have very large production
clusters that use such values successfully.

> Since this is a 1.6.5.1 system, I activated the adaptive timeouts  - and
> put them to equally large values,
> /sys/module/ptlrpc/parameters/at_max = 6000
> /sys/module/ptlrpc/parameters/at_history = 6000
> /sys/module/ptlrpc/parameters/at_early_margin = 50
> /sys/module/ptlrpc/parameters/at_extra = 30

This is likely not good as well.  I will let somebody more knowledgeable
about AT comment in detail though.  It's a new feature and not getting
wide use at all yet, so the real-world experience is still low.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090225/5f1e2c71/attachment.pgp>