[Lustre-discuss] MDS overload, why?
Arne Brutschy
arne.brutschy at ulb.ac.be
Fri Oct 9 01:26:44 PDT 2009
Hi everyone,
2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
2.6.9-78.0.22.
We were quite happy with lustre's performance, especially because
bottlenecks caused by /home disk access were history.
Saturday, the cluster went down (= was inaccessible). After some
investigation I found out that the reason seems to be an overloaded MDS.
Over the following 4 days, this happened multiple times and could only
be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.
The MDS did not respond to any command, if I managed to get a video
signal (not often), load was >170. Additionally, 2 times kernel oops got
displayed, but unfortunately I have to record of them.
The clients showed the following error:
> Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages
So, my question is: what could cause such a load? The cluster was not
exessively used... Is this a bug or a user's job that creates the load?
How can I protect lustre against this kind of failure?
Thanks in advance,
Arne
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 3168
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
More information about the lustre-discuss
mailing list