[Lustre-discuss] Connection losses to MGS/MDS
Thomas Roth
t.roth at gsi.de
Thu Dec 18 10:30:31 PST 2008
Hi all,
in a cluster with 375 clients, for a 12 hour period I get about 500
messages of the type
> Connection to service MGS via nid A.B.C.D at tcp was lost; in progress
operations using this service will fail.
and about 800 messages of the type
> Connection to service MDT0000 via nid A.B.C.D at tcp was lost; in
progress operations using this service will wait for recovery to complete.
Those clients are batch farm nodes, they run continuously all kind of
user jobs that read and write data on Lustre.
I have no way of telling how bad this situation is, since I know only
the error logs of our cluster. I have seen these messages right from the
start of testing this cluster, but did not try to count them, since the
performance then was splendid.
So what is your experience? Should there be no errors of this kind at
all, is it something to be expected on a busy network, should there be a
few connection losses due to specific machine problems, or is this just
normal?
Thanks,
Thomas
More information about the lustre-discuss
mailing list