[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Wed Mar 5 08:34:28 PST 2008

Well, go figure.    We are running...

Lustre: 1.6.4.2 on clients and servers
Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
Platform: X86_64 (opteron 275s, mostly)
Interconnect: IB,  Ethernet
IB Stack: OFED 1.2

We already posted our procedure for patching the kernel, building  
OFED, and building lustre so I don't think I'll go into that  
again.    Like I said, we just brought a new file system online.    
Everything looked fine at first with just a few clients mounted.     
Once we mounted all 408 (or so), we started gettting all kinds of  
"transport endpoint failures" and the MGSs and OSTs were evicting  
clients left and right.    We looked for network problems and could  
not find any of any substance.    Once we increased the obd/lustre/ 
system timeout setting as previously discussed, the errors  
vanished.    This was consistent with our experience with 1.6.3 as  
well.    That file system has been online since early December.    
Both file systems appear to be working well.

I'm not sure what to make of it.    Perhaps we are just masking  
another problem.     Perhaps there are some other, related values  
that need to be tuned.    We've done the best we could but I'm sure  
there is still much about Lustre we don't know.   We'll try to get  
someone out to the next class but until then, we're on our own, so to  
speak.

Charlie Taylor
UF HPC Center

>>
>> Just so you guys know, 1000 seconds for the obd_timeout is very, very
>> large!  As you could probably guess, we have some very, very big  
>> Lustre
>> installations and to the best of my knowledge none of them are using
>> anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
>> experience to some of these very large clusters might correct me) the
>> largest value that the largest clusters are using is in the
>> neighbourhood of 300s.  There has to be some other problem at play  
>> here
>> that you need 1000s.
>
> I can confirm that at a recent large installation with several  
> thousand
> clients, the default of 100 is in effect.
>
>>