[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
Craig Prescott
prescott at hpc.ufl.edu
Tue Mar 4 16:37:05 PST 2008
Hi Aaron;
As Charlie mentioned, we have 400 clients and a timeout
value of 1000 is "enough" for us. How many clients do you
have? If it is more than 400, or the ratio of your o2ib/tcp
clients is not like ours (80/20), you may need a bigger value.
Also, we have observed that occassionally we set the timeout
on out MGS/MDS machine via:
lctl conf_param <fsname>.sys.timeout=1000
but it does not "take" everywhere. That is, you should check
your OSSes and clients to observe that the correct timeout
is reflected in /proc/sys/lustre/timeout. If it isn't, just echo
the correct number in there. If you already checked this, maybe
try a bigger value?
Hope that helps,
Craig Prescott
Aaron Knister wrote:
> I made this change and clients are still being evicted. This is very
> frustrating. It happens over tcp and infiniband. My timeout is 1000.
> Anybody know why don't the clients reconnect?
>
> On Mar 4, 2008, at 3:55 PM, Aaron S. Knister wrote:
>
>> I think I tried that before and it didn't help, but I will try it
>> again. Thanks for the suggestion.
>>
>> -Aaron
>>
>> ----- Original Message -----
>> From: "Charles Taylor" <taylor at hpc.ufl.edu <mailto:taylor at hpc.ufl.edu>>
>> To: "Aaron S. Knister" <aaron at iges.org <mailto:aaron at iges.org>>
>> Cc: "lustre-discuss" <lustre-discuss at clusterfs.com
>> <mailto:lustre-discuss at clusterfs.com>>, "Thomas Wakefield"
>> <twake at cola.iges.org <mailto:twake at cola.iges.org>>
>> Sent: Tuesday, March 4, 2008 3:41:04 PM GMT -05:00 US/Canada Eastern
>> Subject: Re: [Lustre-discuss] Cannot send after transport endpoint
>> shutdown (-108)
>>
>> We've seen this before as well. Our experience is that the
>> obd_timeout is far too small for large clusters (ours is 400+
>> nodes) and the only way we avoid these errors is by setting it to
>> 1000 which seems high to us but appears to work and puts an end to
>> the transport endpoint shutdowns.
>>
>> On the MDS....
>>
>> lctl conf_param srn.sys.timeout=1000
>>
>> You may have to do this on the OSS's as well unless you restart the
>> OSS's but I could be wrong on that. You should check it everywhere
>> with...
>>
>> cat /proc/sys/lustre/timeout
>>
>>
>> On Mar 4, 2008, at 3:31 PM, Aaron S. Knister wrote:
>>
>> > This morning I've had both my infiniband and tcp lustre clients
>> > hiccup. They are evicted from the server presumably as a result of
>> > their high load and consequent timeouts. My question is- why don't
>> > the clients re-connect. The infiniband and tcp clients both give
>> > the following message when I type "df" - Cannot send after
>> > transport endpoint shutdown (-108). I've been battling with this on
>> > and off now for a few months. I've upgraded my infiniband switch
>> > firmware, all the clients and servers are running the latest
>> > version of lustre and the lustre patched kernel. Any ideas?
>> >
>> > -Aaron
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at lists.lustre.org>
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
>
> (301) 595-7000
> aaron at iges.org <mailto:aaron at iges.org>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list