[Lustre-discuss] MDS overload, why?

Wed Oct 21 02:26:14 PDT 2009

On Di, 2009-10-20 at 11:20 -0400, Brian J. Murrell wrote:
> Well, prior to this there are lots of quota errors you really ought to
> look into fixing.  Try searching bugzilla for similar problems.  What
> version of Lustre is this MDS running?

As far as I understand right, these 'errors' just report that a user ran
out of quota. All the cluster is running lustre 1.6.7.2

> As for the MDS problems, the first sign of trouble looks like:
> 
> Lustre: Request x7178129 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s).
> Lustre: lustre-OST0002-osc: Connection to service lustre-OST0002 via nid 10.255.255.202 at tcp was lost; in progress operations using this service will wait for recovery to complete.
> Lustre: Request x7178131 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s).
> Lustre: Request x7178132 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s).
> Lustre: Request x7178178 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 5s ago has timed out (limit 5s).
> 
> Which leads you do OST0002, which is at 10.255.255.202, yes?
> So what's going on there that the MDS lost it's connection to the OST
> there?

Well, thats the big question. The OSS's are basically idle, as we mostly
use small files. The OSS's log for this time:

Oct 19 11:59:33 compute-2-3 kernel: LustreError: 166-1: MGC10.255.255.206 at tcp: Connection to service MGS via nid 10.255.255.206 at tcp was lost; in progress operations using this service will fail.
Oct 19 12:00:08 compute-2-3 kernel: LustreError: 4421:0:(ldlm_lib.c:552:target_handle_reconnect()) lustre-mdtlov_UUID reconnecting from MGC10.255.255.206 at tcp_0, handle mismatch (ours 0x9f3a42f18876fee7, theirs 0x1b158acd85bf638)
Oct 19 12:00:08 compute-2-3 kernel: LustreError: 4421:0:(ldlm_lib.c:786:target_handle_connect()) lustre-OST0003: NID 10.255.255.206 at tcp (lustre-mdtlov_UUID) reconnected with 1 conn_cnt; cookies not random?
Oct 19 12:00:08 compute-2-3 kernel: LustreError: 4421:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-114)  req at f56e9e00 x10/t0 o8-><?>@<?>:0/0 lens 240/144 e 0 to 0 dl 1255946508 ref 1 fl Interpret:/0/0 rc -114/0

Just reporting that it lost connection. It seems that somehow the MDS
looses connection to at least one of the OSS, and then the situation
deteriorates as it gets overwhelmed with reconnect events which it
refuses. The in the end, it seems that every client is refused by the
MFS. This seems to be similar to this problem:
http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010244.html

Currently, the connection to clients and between MDS/OSS's is a simple
gigabit link (2 48port managed switches). We don't have a separate
storage network, as traffic on the job-side of things is very low.
Looking at the interface statistics, I can't see any errors or dropped
packages during normal operation.

Maybe I should connect the lustre servers among themselves and see if
the latency/timeout problems persist.

Regards,
Arne

-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles   Tel      +32 2 650 3168
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)