[Lustre-discuss] MDS crashes daily at the same hour
Christopher J.Walker
C.J.Walker at qmul.ac.uk
Mon Jan 25 08:45:02 PST 2010
Brian J. Murrell wrote:
> On Sun, 2010-01-24 at 22:54 -0700, Andreas Dilger wrote:
>> If they are call traces due to the watchdog timer, then this is somewhat
>> expected for extremely high load.
>
> Andreas,
>
> Do you know, does adaptive timeouts take care of setting the timeout
> appropriately on watchdogs?
>
I don't think this is quite what you are asking, but some details on our
setup.
We have a mixture of 1.6.7.2 clients and 1.8.1.1 clients. The 1.6.7.2
clients were not using adaptive timeouts when the problem occurred[1].
At least one of the 1.6 machines gets regularly swamped with network
traffic - leading to packet loss.
It was 40 1.8.1.1 clients running updatedb that caused the problem.
Chris
[1] One machine is the interface to the outside world - and runs
1.6.7.2. I see packet loss to this machine at times and have observed
lustre hanging for a while. I suspect the problem is that it is
occasionally overloaded with network packets, lustre packets are then
lost (probably at the router), followed by a timeout and recovery. I've
now enabled adaptive timeouts on this machine - and will install a
10GigE card too.
More information about the lustre-discuss
mailing list