[Lustre-discuss] Connection losses to MGS/MDS

Fri Dec 19 03:33:56 PST 2008

Hi Wojciech,

Wojciech Turek wrote:
> Hi,
> 
> It doesn't look healthy. I assume that those messages and the numbers 
> are from the client side, what do you see on the MDS server itself?

I haven't gotten a good correlation between the client- and MDS messages 
yet.
Of course, on the MDS I see evictions, refusal of connections due to 
left over active RPCs, also timeouts because "Request x55349122 took 
longer than estimated" - the whole spectrum I think.

> It seem to me that your network connection to the MDS is flaky and thus 
> so many disconnection messages. It maybe doesn't hurt noticeably your 
> bandwidth  performance but it should certainly kill your mata data 
> performance. I suggest to run some test and see for yourself. From your 
> email I see that you are using Ethernet for connecting MDS to the rest 
> of the cluster. It maybe worth of checking the cable or the interface 
> for errors, dropped packets.

The major trouble as seen from the user's side is of the type "Node A 
doesn't see Lustre". Jobs dispatched to such a node then cannot run and 
exit with failure. On inspection the node is doing fine, Lustre is 
mounted and accessible - it just took too long to reactivate the 
connection. So indeed, meta data performance is  dead.

> 
> I have here 600 nodes cluster 100% utilized with jobs for most of the 
> time, lustre is serving /home and /scratch file system and I don't see 
> these messages in the logs. I use lustre 1.6.6 for RHEL4

Thanks. That's what I wanted to hear.

Regards,
Thomas

> 
> cheers
> 
> Wojciech
> 
> Thomas Roth wrote:
>> Hi all,
>>
>> in a cluster with 375 clients, for a  12 hour period I get about  500 
>> messages  of the type
>>
>>  > Connection to service MGS via nid A.B.C.D at tcp was lost; in progress 
>> operations using this service will fail.
>>
>> and about 800 messages of the type
>>
>>  > Connection to service MDT0000 via nid A.B.C.D at tcp was lost; in 
>> progress operations using this service will wait for recovery to 
>> complete.
>>
>> Those clients are batch farm nodes, they run continuously all kind of 
>> user jobs that read and write data on Lustre.
>>
>> I have no way of telling how bad this situation is, since I know only 
>> the error logs of our cluster. I have seen these messages right from 
>> the start of testing this cluster, but did not try to count them, 
>> since the performance then was splendid.
>>
>> So what is your experience? Should there be no errors of this kind at 
>> all, is it something to be expected on a busy network, should there be 
>> a few connection losses due to specific machine problems, or is this 
>> just normal?
>>
>> Thanks,
>> Thomas
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>   
> 

-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführer: Professor Dr. Horst Stöcker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt