[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Wed Mar 5 10:09:53 PST 2008

Are you running DDR or SDR IB? Also what hardware are you using for  
your storage?

On Mar 5, 2008, at 11:34 AM, Charles Taylor wrote:

> Well, go figure.    We are running...
>
> Lustre: 1.6.4.2 on clients and servers
> Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
> Platform: X86_64 (opteron 275s, mostly)
> Interconnect: IB,  Ethernet
> IB Stack: OFED 1.2
>
> We already posted our procedure for patching the kernel, building
> OFED, and building lustre so I don't think I'll go into that
> again.    Like I said, we just brought a new file system online.
> Everything looked fine at first with just a few clients mounted.
> Once we mounted all 408 (or so), we started gettting all kinds of
> "transport endpoint failures" and the MGSs and OSTs were evicting
> clients left and right.    We looked for network problems and could
> not find any of any substance.    Once we increased the obd/lustre/
> system timeout setting as previously discussed, the errors
> vanished.    This was consistent with our experience with 1.6.3 as
> well.    That file system has been online since early December.
> Both file systems appear to be working well.
>
> I'm not sure what to make of it.    Perhaps we are just masking
> another problem.     Perhaps there are some other, related values
> that need to be tuned.    We've done the best we could but I'm sure
> there is still much about Lustre we don't know.   We'll try to get
> someone out to the next class but until then, we're on our own, so to
> speak.
>
> Charlie Taylor
> UF HPC Center
>
>>>
>>> Just so you guys know, 1000 seconds for the obd_timeout is very,  
>>> very
>>> large!  As you could probably guess, we have some very, very big
>>> Lustre
>>> installations and to the best of my knowledge none of them are using
>>> anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
>>> experience to some of these very large clusters might correct me)  
>>> the
>>> largest value that the largest clusters are using is in the
>>> neighbourhood of 300s.  There has to be some other problem at play
>>> here
>>> that you need 1000s.
>>
>> I can confirm that at a recent large installation with several
>> thousand
>> clients, the default of 100 is in effect.
>>
>>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org