[Lustre-discuss] Connection losses to MGS/MDS
Thomas Roth
t.roth at gsi.de
Fri Dec 19 03:33:56 PST 2008
Hi Wojciech,
Wojciech Turek wrote:
> Hi,
>
> It doesn't look healthy. I assume that those messages and the numbers
> are from the client side, what do you see on the MDS server itself?
I haven't gotten a good correlation between the client- and MDS messages
yet.
Of course, on the MDS I see evictions, refusal of connections due to
left over active RPCs, also timeouts because "Request x55349122 took
longer than estimated" - the whole spectrum I think.
> It seem to me that your network connection to the MDS is flaky and thus
> so many disconnection messages. It maybe doesn't hurt noticeably your
> bandwidth performance but it should certainly kill your mata data
> performance. I suggest to run some test and see for yourself. From your
> email I see that you are using Ethernet for connecting MDS to the rest
> of the cluster. It maybe worth of checking the cable or the interface
> for errors, dropped packets.
The major trouble as seen from the user's side is of the type "Node A
doesn't see Lustre". Jobs dispatched to such a node then cannot run and
exit with failure. On inspection the node is doing fine, Lustre is
mounted and accessible - it just took too long to reactivate the
connection. So indeed, meta data performance is dead.
>
> I have here 600 nodes cluster 100% utilized with jobs for most of the
> time, lustre is serving /home and /scratch file system and I don't see
> these messages in the logs. I use lustre 1.6.6 for RHEL4
Thanks. That's what I wanted to hear.
Regards,
Thomas
>
> cheers
>
> Wojciech
>
> Thomas Roth wrote:
>> Hi all,
>>
>> in a cluster with 375 clients, for a 12 hour period I get about 500
>> messages of the type
>>
>> > Connection to service MGS via nid A.B.C.D at tcp was lost; in progress
>> operations using this service will fail.
>>
>> and about 800 messages of the type
>>
>> > Connection to service MDT0000 via nid A.B.C.D at tcp was lost; in
>> progress operations using this service will wait for recovery to
>> complete.
>>
>> Those clients are batch farm nodes, they run continuously all kind of
>> user jobs that read and write data on Lustre.
>>
>> I have no way of telling how bad this situation is, since I know only
>> the error logs of our cluster. I have seen these messages right from
>> the start of testing this cluster, but did not try to count them,
>> since the performance then was splendid.
>>
>> So what is your experience? Should there be no errors of this kind at
>> all, is it something to be expected on a busy network, should there be
>> a few connection losses due to specific machine problems, or is this
>> just normal?
>>
>> Thanks,
>> Thomas
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
--
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453 Fax: +49-6159-71 2986
GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
D-64291 Darmstadt
www.gsi.de
Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528
Geschäftsführer: Professor Dr. Horst Stöcker
Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
More information about the lustre-discuss
mailing list