[Lustre-discuss] MDS overload, why?

Fri Oct 9 07:28:33 PDT 2009

LMT (http://code.google.com/p/lmt) might be able to give some hints if
users are using the FS in a 'wild' fashion. For the question "what can
cause this behaviour of my MDS" I guess the answer is like: a million
things ;) There is no way of being more specific with more input about
the problem itself.

Michael

Am Freitag, den 09.10.2009, 16:15 +0200 schrieb Arne Brutschy:
> Hi,
> 
> thanks for replying!
> 
> I understand that without further information we can't do much about the
> oopses. I was more hoping for some information regarding possible
> sources of such an overload. Is it normal that a MDS gets overloaded
> like this, while the OSTs have nothing to do, and what can I do about
> it? How can I find the source of the problem?
> 
> More specifically, what are the operations that lead to a lot of MDS
> load and none for the OSTs? Although our MDS (8GB ram, 2x4core, SATA) is
> not a top-notch server, it's fairly recent and I feel the load we're
> experiencing is not handable by a single MDS.
> 
> My problem is that I can't make out major problems in the user's jobs
> running on the cluster, and I can't quantify nor track down the problem
> because I don't know what behavior might have caused it. 
> 
> As I said, ooppses appeared only twice, and all other problems where
> just apparent by a non-responsive MDS.
> 
> Thanks,
> Arne
> 
> 
> On Fr, 2009-10-09 at 07:44 -0400, Brian J. Murrell wrote:
> > On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote:
> > > 
> > > The clients showed the following error:
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  req at f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages
> > > 
> > > So, my question is: what could cause such a load? The cluster was not
> > > exessively used... Is this a bug or a user's job that creates the load?
> > > How can I protect lustre against this kind of failure?
> > 
> > Without any more information we could not possibly know.  If you really
> > are getting oopses then you will need console logs (i.e. serial console)
> > so that we can see the stack trace.
> > 
> > b.
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5997 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091009/ff83e9bd/attachment.bin>