[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
Charles Taylor
taylor at hpc.ufl.edu
Wed Mar 5 03:56:46 PST 2008
Sure, we will provide you with more details of our installation but
let me first say that, if recollection serves, we did not pull that
number out of a hat. I believe that there is a formula in one of
the lustre tuning manuals for calculating the recommended timeout
value. I'll have to take a moment to go back and find it. Anyway,
if you use that formula for our cluster, the recommended timeout
value, I think, comes out to be *much* larger than 1000.
Later this morning, we will go back and find that formula and share
with the list how we came up w/ our timeout. Perhaps you can show
us where we are going wrong.
One more comment.... We just brought up our second large lustre file
system. It is 80+ TB served by 24 OSTs on two (pretty beefy)
OSSs. We just achieved over 2GB/sec of sustained (large block,
sequential) I/O from an aggregate of 20 clients. Our design target
was 1.0 GB/sec/OSS and we hit that pretty comfortably. That said,
when we first mounted the new (1.6.4.2) file system across all 400
nodes in our cluster, we immediately started getting "transport
endpoint failures" and evictions. We looked rather intensively for
network/fabric problems (we have both o2ib and tcp nids) and could
find none. All of our MPI apps are/were running just fine. The
only way we could get rid of the evictions and transport endpoint
failures was by increasing the timeout. Also, we knew to do this
based on our experience with our first lustre file system (1.6.3 +
patches) where we had to do the same thing.
Like I said, a little bit later, Craig or I will post more details
about our implementation. If we are doing something wrong with
regard to this timeout business, I would love to know what it is.
Thanks,
Charlie Taylor
UF HPC Center
On Mar 4, 2008, at 4:04 PM, Brian J. Murrell wrote:
> On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote:
>> I think I tried that before and it didn't help, but I will try it
>> again. Thanks for the suggestion.
>
> Just so you guys know, 1000 seconds for the obd_timeout is very, very
> large! As you could probably guess, we have some very, very big
> Lustre
> installations and to the best of my knowledge none of them are using
> anywhere near that. AFAIK (and perhaps a Sun engineer with closer
> experience to some of these very large clusters might correct me) the
> largest value that the largest clusters are using is in the
> neighbourhood of 300s. There has to be some other problem at play
> here
> that you need 1000s.
>
> Can you both please report your lustre and kernel versions? I know
> you
> said "latest" Aaron, but some version numbers might be more solid
> to go
> on.
>
> b.
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list