[Lustre-discuss] MDS crashes daily at the same hour

Christopher J.Walker C.J.Walker at qmul.ac.uk
Mon Jan 25 08:45:02 PST 2010


Brian J. Murrell wrote:
> On Sun, 2010-01-24 at 22:54 -0700, Andreas Dilger wrote: 
>> If they are call traces due to the watchdog timer, then this is somewhat
>> expected for extremely high load.
> 
> Andreas,
> 
> Do you know, does adaptive timeouts take care of setting the timeout
> appropriately on watchdogs?
> 

I don't think this is quite what you are asking, but some details on our 
setup.

We have a mixture of 1.6.7.2 clients and 1.8.1.1 clients. The 1.6.7.2 
clients were not using adaptive timeouts when the problem occurred[1]. 
At least one of the 1.6 machines gets regularly swamped with network 
traffic - leading to packet loss.

It was 40 1.8.1.1 clients running updatedb that caused the problem.

Chris

[1] One machine is the interface to the outside world - and runs 
1.6.7.2. I see packet loss to this machine at times and have observed 
lustre  hanging for a while. I suspect the problem is that it is 
occasionally overloaded with network packets, lustre packets are then 
lost (probably at the router), followed by a timeout and recovery. I've 
now enabled adaptive timeouts on this machine - and will install a 
10GigE card too.



More information about the lustre-discuss mailing list