[Lustre-discuss] MDS overload, why?

Brian J. Murrell Brian.Murrell at Sun.COM
Tue Oct 20 08:20:22 PDT 2009


On Tue, 2009-10-20 at 16:13 +0200, Arne Brutschy wrote:
> 
> As suggested, I set up a serial console monitor on the MDS and installed
> the lmt (not particularly easy :). This morning, the MDS was overloaded
> again. The cpu load spike beyond 150 and the logs where showing the
> following errors:

Well, prior to this there are lots of quota errors you really ought to
look into fixing.  Try searching bugzilla for similar problems.  What
version of Lustre is this MDS running?

As for the MDS problems, the first sign of trouble looks like:

Lustre: Request x7178129 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s).
Lustre: lustre-OST0002-osc: Connection to service lustre-OST0002 via nid 10.255.255.202 at tcp was lost; in progress operations using this service will wait for recovery to complete.
Lustre: Request x7178131 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s).
Lustre: Request x7178132 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s).
Lustre: Request x7178178 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 5s ago has timed out (limit 5s).

Which leads you do OST0002, which is at 10.255.255.202, yes?

So what's going on there that the MDS lost it's connection to the OST
there?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091020/c00bdca8/attachment.pgp>


More information about the lustre-discuss mailing list