[Lustre-discuss] Connection losses to MGS/MDS

Thomas Roth t.roth at gsi.de
Thu Dec 18 10:30:31 PST 2008


Hi all,

in a cluster with 375 clients, for a  12 hour period I get about  500 
messages  of the type

 > Connection to service MGS via nid A.B.C.D at tcp was lost; in progress 
operations using this service will fail.

and about 800 messages of the type

 > Connection to service MDT0000 via nid A.B.C.D at tcp was lost; in 
progress operations using this service will wait for recovery to complete.

Those clients are batch farm nodes, they run continuously all kind of 
user jobs that read and write data on Lustre.

I have no way of telling how bad this situation is, since I know only 
the error logs of our cluster. I have seen these messages right from the 
start of testing this cluster, but did not try to count them, since the 
performance then was splendid.

So what is your experience? Should there be no errors of this kind at 
all, is it something to be expected on a busy network, should there be a 
few connection losses due to specific machine problems, or is this just 
normal?

Thanks,
Thomas




More information about the lustre-discuss mailing list