[Lustre-discuss] MDS overload, why?

Fri Oct 9 07:10:22 PDT 2009

Hmm. Should be enough. I guess you need to set up a loghost for syslog
then and a reliable serial console to get stack traces. Everything else
would be just a wild guess (as the question for the ram size was).

Michael

> Hi,
> 
> 8GB of ram, 2x 4core Intel Xeon E5410 @ 2.33GHz
> 
> Arne
> 
> On Fr, 2009-10-09 at 12:16 +0200, Michael Kluge wrote:
> > Hi Arne,
> > 
> > could be memory pressure and the OOM running and shooting at things. How
> > much memory does you server has?
> > 
> > 
> > Michael
> > 
> > Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy:
> > > Hi everyone,
> > > 
> > > 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
> > > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
> > > 2.6.9-78.0.22.
> > > 
> > > We were quite happy with lustre's performance, especially because
> > > bottlenecks caused by /home disk access were history.
> > > 
> > > Saturday, the cluster went down (= was inaccessible). After some
> > > investigation I found out that the reason seems to be an overloaded MDS.
> > > Over the following 4 days, this happened multiple times and could only
> > > be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.
> > > 
> > > The MDS did not respond to any command, if I managed to get a video
> > > signal (not often), load was >170. Additionally, 2 times kernel oops got
> > > displayed, but unfortunately I have to record of them.
> > > 
> > > The clients showed the following error:
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  req at f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages
> > > 
> > > So, my question is: what could cause such a load? The cluster was not
> > > exessively used... Is this a bug or a user's job that creates the load?
> > > How can I protect lustre against this kind of failure?
> > > 
> > > Thanks in advance,
> > > Arne 
> > > 
-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5997 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091009/9e93182e/attachment.bin>