[Lustre-discuss] Recovery without end

Wed Feb 25 08:22:49 PST 2009

I'm going to pipe in here.    We too use a very large (1000) timeout  
value.   We have two separate luster file systems one of them consists  
of two rather beefy OSSs with 12 OSTs each (FalconIII FC-SATA RAID).    
The other consists of 8 OSSs with 3 OSTs each (Xyratex 4900FC).   We  
have about 500 clients and support both tcp and o2ib NIDS.   We run  
Lustre 1.6.4.2 on a patched 2.6.18-8.1.14 CentOS/RH kernel.   It has  
worked *very* well for us for over a year now - very few problems with  
very good performance under very heavy loads.

We've tried setting our timeout to lower values but settled on the  
1000 value (despite the long recovery periods) because if we don't,  
our lustre connectivity starts to breakdown and our mounts come and go  
with errors like "transport endpoint failure" or "transport endpoint  
not connected" or some such (its been a while now).    File system  
access comes and goes randomly on nodes.    We tried many tunings and  
looked for other sources of  problems (underlying network issues).    
Ultimately, the only thing we found that fixed this was to extend the  
timeout value.

I know you will be tempted to tell us that our network must be flakey  
but it simply is not.   We'd love to understand why we need such a  
large timeout value and why, if we don't use a large value, we see  
these transport end-point failures.    However, after spending several  
days trying to understand and resolve the issue, we finally just  
accepted the long timeout as a suitable workaround.

I wonder if there are others who have silently done the same.   We'll  
be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future.    Maybe  
then we'll be able to do away with the long timeout value but until  
then, we need it.  :(

Just my two cents,

Charlie Taylor
UF HPC Center

On Feb 25, 2009, at 11:03 AM, Brian J. Murrell wrote:

> On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote:
>>
>> Our /proc/sys/lustre/timeout is 1000
>
> That's way to high.  Long recoveries are exactly the reason you don't
> want this number to be huge.
>
>> - there has been some debate on
>> this large value here, but most other installation will not run in a
>> network environment with a setup as crazy as ours.
>
> What's so crazy about your set up?  Unless your network is very flaky
> and/or you have not tuned your OSSes properly, there should be no need
> for such a high timeout and if there is you need to address the  
> problems
> requiring it.
>
>> Putting the timeout
>> to 100 immediately results in "Transport endpoint" errors,  
>> impossible to
>> run Lustre like this.
>
> 300 is the max that we recommend and we have very large production
> clusters that use such values successfully.
>
>> Since this is a 1.6.5.1 system, I activated the adaptive timeouts   
>> - and
>> put them to equally large values,
>> /sys/module/ptlrpc/parameters/at_max = 6000
>> /sys/module/ptlrpc/parameters/at_history = 6000
>> /sys/module/ptlrpc/parameters/at_early_margin = 50
>> /sys/module/ptlrpc/parameters/at_extra = 30
>
> This is likely not good as well.  I will let somebody more  
> knowledgeable
> about AT comment in detail though.  It's a new feature and not getting
> wide use at all yet, so the real-world experience is still low.
>
> b.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss