[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Aaron Knister aaron at iges.org
Tue Mar 4 14:42:07 PST 2008


I made this change and clients are still being evicted. This is very  
frustrating. It happens over tcp and infiniband. My timeout is 1000.  
Anybody know why don't the clients reconnect?

On Mar 4, 2008, at 3:55 PM, Aaron S. Knister wrote:

> I think I tried that before and it didn't help, but I will try it  
> again. Thanks for the suggestion.
>
> -Aaron
>
> ----- Original Message -----
> From: "Charles Taylor" <taylor at hpc.ufl.edu>
> To: "Aaron S. Knister" <aaron at iges.org>
> Cc: "lustre-discuss" <lustre-discuss at clusterfs.com>, "Thomas  
> Wakefield" <twake at cola.iges.org>
> Sent: Tuesday, March 4, 2008 3:41:04 PM GMT -05:00 US/Canada Eastern
> Subject: Re: [Lustre-discuss] Cannot send after transport endpoint  
> shutdown (-108)
>
> We've seen this before as well.    Our experience is that the
> obd_timeout is  far too small for large clusters (ours is 400+
> nodes)  and the only way we avoid these errors is by setting it to
> 1000 which seems high to us but  appears to work and puts an end to
> the transport endpoint shutdowns.
>
> On the MDS....
>
> lctl conf_param srn.sys.timeout=1000
>
> You may have to do this on the OSS's as well unless you restart the
> OSS's but I could be wrong on that.   You should check it everywhere
> with...
>
> cat /proc/sys/lustre/timeout
>
>
> On Mar 4, 2008, at 3:31 PM, Aaron S. Knister wrote:
>
> > This morning I've had both my infiniband and tcp lustre clients
> > hiccup. They are evicted from the server presumably as a result of
> > their high load and consequent timeouts. My question is- why don't
> > the clients re-connect. The infiniband and tcp clients both give
> > the following message when I type "df" - Cannot send after
> > transport endpoint shutdown (-108). I've been battling with this on
> > and off now for a few months. I've upgraded my infiniband switch
> > firmware, all the clients and servers are running the latest
> > version of lustre and the lustre patched kernel. Any ideas?
> >
> > -Aaron
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080304/59f16d35/attachment.htm>


More information about the lustre-discuss mailing list