[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Tue Mar 4 16:37:05 PST 2008

Hi Aaron;

As Charlie mentioned, we have 400 clients and a timeout
value of 1000 is "enough" for us.  How many clients do you
have?  If it is more than 400, or the ratio of your o2ib/tcp
clients is not like ours (80/20), you may need a bigger value.

Also, we have observed that occassionally we set the timeout
on out MGS/MDS machine via:

lctl conf_param <fsname>.sys.timeout=1000

but it does not "take" everywhere.  That is, you should check
your OSSes and clients to observe that the correct timeout
is reflected in /proc/sys/lustre/timeout.  If it isn't, just echo
the correct number in there.  If you already checked this, maybe
try a bigger value?

Hope that helps,
Craig Prescott

Aaron Knister wrote:
> I made this change and clients are still being evicted. This is very 
> frustrating. It happens over tcp and infiniband. My timeout is 1000. 
> Anybody know why don't the clients reconnect?
>
> On Mar 4, 2008, at 3:55 PM, Aaron S. Knister wrote:
>
>> I think I tried that before and it didn't help, but I will try it 
>> again. Thanks for the suggestion.
>>
>> -Aaron
>>
>> ----- Original Message -----
>> From: "Charles Taylor" <taylor at hpc.ufl.edu <mailto:taylor at hpc.ufl.edu>>
>> To: "Aaron S. Knister" <aaron at iges.org <mailto:aaron at iges.org>>
>> Cc: "lustre-discuss" <lustre-discuss at clusterfs.com 
>> <mailto:lustre-discuss at clusterfs.com>>, "Thomas Wakefield" 
>> <twake at cola.iges.org <mailto:twake at cola.iges.org>>
>> Sent: Tuesday, March 4, 2008 3:41:04 PM GMT -05:00 US/Canada Eastern
>> Subject: Re: [Lustre-discuss] Cannot send after transport endpoint 
>> shutdown (-108)
>>
>> We've seen this before as well.    Our experience is that the  
>> obd_timeout is  far too small for large clusters (ours is 400+  
>> nodes)  and the only way we avoid these errors is by setting it to  
>> 1000 which seems high to us but  appears to work and puts an end to  
>> the transport endpoint shutdowns.
>>
>> On the MDS....
>>
>> lctl conf_param srn.sys.timeout=1000
>>
>> You may have to do this on the OSS's as well unless you restart the  
>> OSS's but I could be wrong on that.   You should check it everywhere  
>> with...
>>
>> cat /proc/sys/lustre/timeout
>>
>>
>> On Mar 4, 2008, at 3:31 PM, Aaron S. Knister wrote:
>>
>> > This morning I've had both my infiniband and tcp lustre clients  
>> > hiccup. They are evicted from the server presumably as a result of  
>> > their high load and consequent timeouts. My question is- why don't  
>> > the clients re-connect. The infiniband and tcp clients both give  
>> > the following message when I type "df" - Cannot send after  
>> > transport endpoint shutdown (-108). I've been battling with this on  
>> > and off now for a few months. I've upgraded my infiniband switch  
>> > firmware, all the clients and servers are running the latest  
>> > version of lustre and the lustre patched kernel. Any ideas?
>> >
>> > -Aaron
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at lists.lustre.org>
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss 
>>
>
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
>
> (301) 595-7000
> aaron at iges.org <mailto:aaron at iges.org>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>