[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Wed Mar 5 03:56:46 PST 2008

Sure, we will provide you with more details of our installation but  
let me first say that, if recollection serves, we did not pull that  
number out of a hat.   I believe that there is a formula in one of  
the lustre tuning manuals for calculating the recommended timeout  
value.   I'll have to take a moment to go back and find it.   Anyway,  
if you use that formula for our cluster, the recommended timeout  
value, I think, comes out to be *much* larger than 1000.

Later this morning, we will go back and find that formula and share  
with the list how we came up w/ our timeout.   Perhaps you can show  
us where we are going wrong.

One more comment.... We just brought up our second large lustre file  
system.   It is 80+ TB served by 24 OSTs on two (pretty beefy)  
OSSs.   We just achieved over 2GB/sec of sustained (large block,  
sequential) I/O from an aggregate of 20 clients.    Our design target  
was 1.0 GB/sec/OSS and we hit that pretty comfortably.   That said,  
when we first mounted the new (1.6.4.2) file system across all 400  
nodes in our cluster, we immediately started getting "transport  
endpoint failures" and evictions.   We looked rather intensively for  
network/fabric problems (we have both o2ib and tcp nids) and could  
find none.   All of our MPI apps are/were running just fine.   The  
only way we could get rid of the evictions and transport endpoint  
failures was by increasing the timeout.   Also, we knew to do this  
based on our experience with our first lustre file system (1.6.3 +  
patches) where we had to do the same thing.

Like I said, a little bit later, Craig or I will post more details  
about our implementation.   If we are doing something wrong with  
regard to this timeout business, I would love to know what it is.

Thanks,

Charlie Taylor
UF HPC Center

On Mar 4, 2008, at 4:04 PM, Brian J. Murrell wrote:

> On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote:
>> I think I tried that before and it didn't help, but I will try it
>> again. Thanks for the suggestion.
>
> Just so you guys know, 1000 seconds for the obd_timeout is very, very
> large!  As you could probably guess, we have some very, very big  
> Lustre
> installations and to the best of my knowledge none of them are using
> anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
> experience to some of these very large clusters might correct me) the
> largest value that the largest clusters are using is in the
> neighbourhood of 300s.  There has to be some other problem at play  
> here
> that you need 1000s.
>
> Can you both please report your lustre and kernel versions?  I know  
> you
> said "latest" Aaron, but some version numbers might be more solid  
> to go
> on.
>
> b.
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss